1 00:00:07,961 --> 00:00:11,075 - Okay. Can everyone hear me? Okay. 2 00:00:11,075 --> 00:00:12,153 Sorry for the delay. 3 00:00:12,153 --> 00:00:13,563 I had a bit of technical difficulty. 4 00:00:13,563 --> 00:00:15,891 Today was the first time I was trying to use my 5 00:00:15,891 --> 00:00:17,843 new touch bar Mac book pro for presenting, 6 00:00:17,843 --> 00:00:19,433 and none of the adapters are working. 7 00:00:19,433 --> 00:00:21,814 So, I had to switch laptops at the last minute. 8 00:00:21,814 --> 00:00:22,731 So, thanks. 9 00:00:24,003 --> 00:00:25,353 Sorry about that. 10 00:00:25,353 --> 00:00:27,451 So, today is lecture 10. 11 00:00:27,451 --> 00:00:30,420 We're talking about recurrent neural networks. 12 00:00:30,420 --> 00:00:33,003 So, as of, as usual, a couple administrative notes. 13 00:00:33,003 --> 00:00:34,835 So, We're working hard 14 00:00:35,891 --> 00:00:37,353 on assignment one grading. 15 00:00:37,353 --> 00:00:38,913 Those grades will probably be out 16 00:00:38,913 --> 00:00:40,921 sometime later today. 17 00:00:40,921 --> 00:00:42,531 Hopefully, they can get out 18 00:00:42,531 --> 00:00:44,382 before the A2 deadline. 19 00:00:44,382 --> 00:00:46,251 That's what I'm hoping for. 20 00:00:46,251 --> 00:00:50,361 On a related note, Assignment two is due today at 11:59 p.m. 21 00:00:50,361 --> 00:00:53,111 so, who's done with that already? 22 00:00:55,280 --> 00:00:56,633 About half you guys. 23 00:00:56,633 --> 00:00:58,273 So, you remember, I did warn you 24 00:00:58,273 --> 00:00:59,251 when the assignment went out 25 00:00:59,251 --> 00:01:01,091 that it was quite long, to start early. 26 00:01:01,091 --> 00:01:03,811 So, you were warned about that. 27 00:01:03,811 --> 00:01:06,561 But, hopefully, you guys have some late days left. 28 00:01:06,561 --> 00:01:08,160 Also, as another reminder, 29 00:01:08,160 --> 00:01:10,531 the midterm will be in class on Tuesday. 30 00:01:10,531 --> 00:01:12,321 If you kind of look around the lecture hall, 31 00:01:12,321 --> 00:01:13,782 there are not enough seats in this room 32 00:01:13,782 --> 00:01:15,881 to seat all the enrolled students in the class. 33 00:01:15,881 --> 00:01:18,113 So, we'll actually be having the midterm 34 00:01:18,113 --> 00:01:20,062 in several other lecture halls across campus. 35 00:01:20,062 --> 00:01:21,702 And we'll be sending out some more details 36 00:01:21,702 --> 00:01:26,099 on exactly where to go in the next couple of days. 37 00:01:26,099 --> 00:01:28,179 So a bit of a, another bit of announcement. 38 00:01:28,179 --> 00:01:29,161 We've been working on this sort of 39 00:01:29,161 --> 00:01:31,169 fun bit of extra credit thing for you to play with 40 00:01:31,169 --> 00:01:33,249 that we're calling the training game. 41 00:01:33,249 --> 00:01:34,950 This is this cool browser based experience, 42 00:01:34,950 --> 00:01:36,159 where you can go in 43 00:01:36,159 --> 00:01:37,480 and interactively train neural networks 44 00:01:37,480 --> 00:01:39,927 and tweak the hyper parameters during training. 45 00:01:39,927 --> 00:01:42,289 And this should be a really cool interactive way 46 00:01:42,289 --> 00:01:43,219 for you to practice 47 00:01:43,219 --> 00:01:45,395 some of these hyper parameter tuning skills 48 00:01:45,395 --> 00:01:47,646 that we've been talking about the last couple of lectures. 49 00:01:47,646 --> 00:01:48,681 So this is not required, 50 00:01:48,681 --> 00:01:51,629 but this, I think, will be a really useful experience 51 00:01:51,629 --> 00:01:53,190 to gain a little bit more intuition 52 00:01:53,190 --> 00:01:54,961 into how some of these hyper parameters work 53 00:01:54,961 --> 00:01:57,481 for different types of data sets in practice. 54 00:01:57,481 --> 00:01:59,339 So we're still working on getting 55 00:01:59,339 --> 00:02:00,950 all the bugs worked out of this setup, 56 00:02:00,950 --> 00:02:02,319 and we'll probably send out 57 00:02:02,319 --> 00:02:03,459 some more instructions 58 00:02:03,459 --> 00:02:04,499 on exactly how this will work 59 00:02:04,499 --> 00:02:05,790 in the next couple of days. 60 00:02:05,790 --> 00:02:07,209 But again, not required. 61 00:02:07,209 --> 00:02:08,280 But please do check it out. 62 00:02:08,280 --> 00:02:09,161 I think it'll be really fun 63 00:02:09,161 --> 00:02:11,008 and a really cool thing for you to play with. 64 00:02:11,008 --> 00:02:12,529 And will give you a bit of extra credit 65 00:02:12,529 --> 00:02:13,779 if you do some, 66 00:02:14,783 --> 00:02:15,897 if you end up working with this 67 00:02:15,897 --> 00:02:18,208 and doing a couple of runs with it. 68 00:02:18,208 --> 00:02:20,041 So, we'll again send out some more details about this 69 00:02:20,041 --> 00:02:23,458 soon once we get all the bugs worked out. 70 00:02:24,720 --> 00:02:25,819 As a reminder, 71 00:02:25,819 --> 00:02:28,139 last time we were talking about CNN Architectures. 72 00:02:28,139 --> 00:02:29,990 We kind of walked through the time line 73 00:02:29,990 --> 00:02:31,150 of some of the various winners 74 00:02:31,150 --> 00:02:33,081 of the image net classification challenge, 75 00:02:33,081 --> 00:02:35,006 kind of the breakthrough result. 76 00:02:35,006 --> 00:02:37,659 As we saw was the AlexNet architecture in 2012, 77 00:02:37,659 --> 00:02:39,979 which was a nine layer convolutional network. 78 00:02:39,979 --> 00:02:41,331 It did amazingly well, 79 00:02:41,331 --> 00:02:42,401 and it sort of kick started 80 00:02:42,401 --> 00:02:44,921 this whole deep learning revolution in computer vision, 81 00:02:44,921 --> 00:02:46,361 and kind of brought a lot of these models 82 00:02:46,361 --> 00:02:48,081 into the mainstream. 83 00:02:48,081 --> 00:02:50,038 Then we skipped ahead a couple years, 84 00:02:50,038 --> 00:02:52,470 and saw that in 2014 image net challenge, 85 00:02:52,470 --> 00:02:54,278 we had these two really interesting models, 86 00:02:54,278 --> 00:02:55,111 VGG and GoogLeNet, 87 00:02:55,111 --> 00:02:56,699 which were much deeper. 88 00:02:56,699 --> 00:02:57,532 So VGG was, 89 00:02:57,532 --> 00:02:59,870 they had a 16 and a 19 layer model, 90 00:02:59,870 --> 00:03:01,150 and GoogLeNet was, I believe, 91 00:03:01,150 --> 00:03:02,930 a 22 layer model. 92 00:03:02,930 --> 00:03:04,750 Although one thing that is kind of interesting 93 00:03:04,750 --> 00:03:05,811 about these models 94 00:03:05,811 --> 00:03:08,121 is that the 2014 image net challenge 95 00:03:08,121 --> 00:03:11,230 was right before batch normalization was invented. 96 00:03:11,230 --> 00:03:12,411 So at this time, 97 00:03:12,411 --> 00:03:13,979 before the invention of batch normalization, 98 00:03:13,979 --> 00:03:15,870 training these relatively deep models 99 00:03:15,870 --> 00:03:18,761 of roughly twenty layers was very challenging. 100 00:03:18,761 --> 00:03:20,510 So, in fact, both of these two models 101 00:03:20,510 --> 00:03:22,310 had to resort to a little bit of hackery 102 00:03:22,310 --> 00:03:24,869 in order to get their deep models to converge. 103 00:03:24,869 --> 00:03:28,579 So for VGG, they had the 16 and 19 layer models, 104 00:03:28,579 --> 00:03:31,870 but actually they first trained an 11 layer model, 105 00:03:31,870 --> 00:03:34,107 because that was what they could get to converge. 106 00:03:34,107 --> 00:03:36,600 And then added some extra random layers in the middle 107 00:03:36,600 --> 00:03:37,771 and then continued training, 108 00:03:37,771 --> 00:03:40,059 actually training the 16 and 19 layer models. 109 00:03:40,059 --> 00:03:42,361 So, managing this training process 110 00:03:42,361 --> 00:03:44,058 was very challenging in 2014 111 00:03:44,058 --> 00:03:46,539 before the invention of batch normalization. 112 00:03:46,539 --> 00:03:47,801 Similarly, for GoogLeNet, 113 00:03:47,801 --> 00:03:50,251 we saw that GoogLeNet has these auxiliary classifiers 114 00:03:50,251 --> 00:03:52,539 that were stuck into lower layers of the network. 115 00:03:52,539 --> 00:03:54,959 And these were not really needed for the class to, 116 00:03:54,959 --> 00:03:56,539 to get good classification performance. 117 00:03:56,539 --> 00:03:59,390 This was just sort of a way to cause 118 00:03:59,390 --> 00:04:01,201 extra gradient to be injected 119 00:04:01,201 --> 00:04:03,430 directly into the lower layers of the network. 120 00:04:03,430 --> 00:04:05,315 And this sort of, 121 00:04:05,315 --> 00:04:08,110 this again was before the invention of batch normalization 122 00:04:08,110 --> 00:04:09,430 and now once you have these networks 123 00:04:09,430 --> 00:04:10,411 with batch normalization, 124 00:04:10,411 --> 00:04:13,321 then you no longer need these slightly ugly hacks 125 00:04:13,321 --> 00:04:17,321 in order to get these deeper models to converge. 126 00:04:17,321 --> 00:04:20,801 Then we also saw in the 2015 image net challenge 127 00:04:20,801 --> 00:04:22,979 was this really cool model called ResNet, 128 00:04:22,979 --> 00:04:24,350 these residual networks 129 00:04:24,350 --> 00:04:26,321 that now have these shortcut connections 130 00:04:26,321 --> 00:04:28,310 that actually have these little residual blocks 131 00:04:28,310 --> 00:04:30,721 where we're going to take our input, 132 00:04:30,721 --> 00:04:32,280 pass it through the residual blocks, 133 00:04:32,280 --> 00:04:33,989 and then add that output of the, 134 00:04:33,989 --> 00:04:36,569 then add our input to the block, 135 00:04:36,569 --> 00:04:39,110 to the output from these convolutional layers. 136 00:04:39,110 --> 00:04:40,790 This is kind of a funny architecture, 137 00:04:40,790 --> 00:04:43,308 but it actually has two really nice properties. 138 00:04:43,308 --> 00:04:45,691 One is that if we just set all the weights 139 00:04:45,691 --> 00:04:47,481 in this residual block to zero, 140 00:04:47,481 --> 00:04:49,531 then this block is competing the identity. 141 00:04:49,531 --> 00:04:50,451 So in some way, 142 00:04:50,451 --> 00:04:52,881 it's relatively easy for this model 143 00:04:52,881 --> 00:04:55,681 to learn not to use the layers that it doesn't need. 144 00:04:55,681 --> 00:04:57,179 In addition, it kind of adds this 145 00:04:57,179 --> 00:05:00,110 interpretation to L2 regularization 146 00:05:00,110 --> 00:05:02,171 in the context of these neural networks, 147 00:05:02,171 --> 00:05:03,771 cause once you put L2 regularization, 148 00:05:03,771 --> 00:05:04,881 remember, on your, 149 00:05:04,881 --> 00:05:06,059 on the weights of your network, 150 00:05:06,059 --> 00:05:08,321 that's going to drive all the parameters towards zero. 151 00:05:08,321 --> 00:05:10,211 And maybe your standard convolutional architecture 152 00:05:10,211 --> 00:05:11,379 is driving towards zero. 153 00:05:11,379 --> 00:05:12,739 Maybe it doesn't make sense. 154 00:05:12,739 --> 00:05:14,481 But in the context of a residual network, 155 00:05:14,481 --> 00:05:16,561 if you drive all the parameters towards zero, 156 00:05:16,561 --> 00:05:18,051 that's kind of encouraging the model 157 00:05:18,051 --> 00:05:20,510 to not use layers that it doesn't need, 158 00:05:20,510 --> 00:05:21,891 because it will just drive those, 159 00:05:21,891 --> 00:05:23,579 the residual blocks towards the identity, 160 00:05:23,579 --> 00:05:26,310 whether or not needed for classification. 161 00:05:26,310 --> 00:05:28,538 The other really useful property of these residual networks 162 00:05:28,538 --> 00:05:31,371 has to do with the gradient flow in the backward paths. 163 00:05:31,371 --> 00:05:32,971 If you remember what happens at these addition gates 164 00:05:32,971 --> 00:05:34,361 in the backward pass, 165 00:05:34,361 --> 00:05:35,451 when upstream gradient is coming in 166 00:05:35,451 --> 00:05:37,110 through an addition gate, 167 00:05:37,110 --> 00:05:37,943 then it will split 168 00:05:37,943 --> 00:05:39,881 and fork along these two different paths. 169 00:05:39,881 --> 00:05:42,750 So then, when upstream gradient comes in, 170 00:05:42,750 --> 00:05:46,361 it'll take one path through these convolutional blocks, 171 00:05:46,361 --> 00:05:48,611 but it will also have a direct connection of the gradient 172 00:05:48,611 --> 00:05:50,811 through this residual connection. 173 00:05:50,811 --> 00:05:51,920 So then when you look at, 174 00:05:51,920 --> 00:05:53,211 when you imagine stacking many of these 175 00:05:53,211 --> 00:05:55,630 residual blocks on top of each other, 176 00:05:55,630 --> 00:05:57,091 and our network ends up with hundreds of, 177 00:05:57,091 --> 00:05:59,150 potentially hundreds of layers. 178 00:05:59,150 --> 00:06:00,578 Then, these residual connections 179 00:06:00,578 --> 00:06:02,841 give a sort of gradient super highway 180 00:06:02,841 --> 00:06:04,361 for gradients to flow backward 181 00:06:04,361 --> 00:06:05,561 through the entire network. 182 00:06:05,561 --> 00:06:08,299 And this allows it to train much easier 183 00:06:08,299 --> 00:06:09,630 and much faster. 184 00:06:09,630 --> 00:06:11,499 And actually allows these things to converge 185 00:06:11,499 --> 00:06:12,459 reasonably well, 186 00:06:12,459 --> 00:06:15,738 even when the model is potentially hundreds of layers deep. 187 00:06:15,738 --> 00:06:19,122 And this idea of managing gradient flow in your models 188 00:06:19,122 --> 00:06:21,550 is actually super important everywhere in machine learning. 189 00:06:21,550 --> 00:06:23,880 And super prevalent in recurrent networks as well. 190 00:06:23,880 --> 00:06:26,481 So we'll definitely revisit this idea of gradient flow 191 00:06:26,481 --> 00:06:28,564 later in today's lecture. 192 00:06:31,148 --> 00:06:34,377 So then, we kind of also saw a couple other more exotic, 193 00:06:34,377 --> 00:06:36,131 more recent CNN architectures last time, 194 00:06:36,131 --> 00:06:38,068 including DenseNet and FractalNet, 195 00:06:38,068 --> 00:06:39,771 and once you think about these architectures 196 00:06:39,771 --> 00:06:41,321 in terms of gradient flow, 197 00:06:41,321 --> 00:06:43,070 they make a little bit more sense. 198 00:06:43,070 --> 00:06:45,019 These things like DenseNet and FractalNet 199 00:06:45,019 --> 00:06:46,761 are adding these additional shortcut 200 00:06:46,761 --> 00:06:48,619 or identity connections inside the model. 201 00:06:48,619 --> 00:06:50,241 And if you think about what happens 202 00:06:50,241 --> 00:06:52,011 in the backwards pass for these models, 203 00:06:52,011 --> 00:06:53,411 these additional funny topologies 204 00:06:53,411 --> 00:06:55,251 are basically providing direct paths 205 00:06:55,251 --> 00:06:56,550 for gradients to flow 206 00:06:56,550 --> 00:06:58,139 from the loss at the end of the network 207 00:06:58,139 --> 00:07:00,571 more easily into all the different layers of the network. 208 00:07:00,571 --> 00:07:01,870 So I think that, 209 00:07:01,870 --> 00:07:04,491 again, this idea of managing gradient flow properly 210 00:07:04,491 --> 00:07:06,579 in your CNN Architectures 211 00:07:06,579 --> 00:07:07,771 is something that we've really seen 212 00:07:07,771 --> 00:07:09,760 a lot more in the last couple of years. 213 00:07:09,760 --> 00:07:11,721 And will probably see more moving forward 214 00:07:11,721 --> 00:07:15,221 as more exotic architectures are invented. 215 00:07:16,257 --> 00:07:18,189 We also saw this kind of nice plot, 216 00:07:18,189 --> 00:07:19,939 plotting performance of 217 00:07:19,939 --> 00:07:22,081 the number of flops versus the number of parameters 218 00:07:22,081 --> 00:07:24,331 versus the run time of these various models. 219 00:07:24,331 --> 00:07:25,790 And there's some interesting characteristics 220 00:07:25,790 --> 00:07:27,971 that you can dive in and see from this plot. 221 00:07:27,971 --> 00:07:31,110 One idea is that VGG and AlexNet 222 00:07:31,110 --> 00:07:32,801 have a huge number of parameters, 223 00:07:32,801 --> 00:07:34,291 and these parameters actually come 224 00:07:34,291 --> 00:07:35,961 almost entirely from the fully connected layers 225 00:07:35,961 --> 00:07:37,119 of the models. 226 00:07:37,119 --> 00:07:39,959 So AlexNet has something like roughly 62 million parameters, 227 00:07:39,959 --> 00:07:42,470 and if you look at that last fully connected layer, 228 00:07:42,470 --> 00:07:44,761 the final fully connected layer in AlexNet 229 00:07:44,761 --> 00:07:47,771 is going from an activation volume of six by six by 256 230 00:07:47,771 --> 00:07:51,190 into this fully connected vector of 496. 231 00:07:51,190 --> 00:07:53,310 So if you imagine what the weight matrix 232 00:07:53,310 --> 00:07:54,851 needs to look like at that layer, 233 00:07:54,851 --> 00:07:56,851 the weight matrix is gigantic. 234 00:07:56,851 --> 00:07:58,630 It's number of entries is six by six, 235 00:07:58,630 --> 00:08:01,921 six times six times 256 times 496. 236 00:08:01,921 --> 00:08:03,390 And if you multiply that out, 237 00:08:03,390 --> 00:08:04,491 you see that that single layer 238 00:08:04,491 --> 00:08:06,370 has 38 million parameters. 239 00:08:06,370 --> 00:08:07,977 So more than half of the parameters 240 00:08:07,977 --> 00:08:09,219 of the entire AlexNet model 241 00:08:09,219 --> 00:08:11,859 are just sitting in that last fully connected layer. 242 00:08:11,859 --> 00:08:13,339 And if you add up all the parameters 243 00:08:13,339 --> 00:08:15,971 in just the fully connected layers of AlexNet, 244 00:08:15,971 --> 00:08:17,739 including these other fully connected layers, 245 00:08:17,739 --> 00:08:20,721 you see something like 59 of the 62 million 246 00:08:20,721 --> 00:08:21,841 parameters in AlexNet 247 00:08:21,841 --> 00:08:24,241 are sitting in these fully connected layers. 248 00:08:24,241 --> 00:08:26,439 So then when we move other architectures, 249 00:08:26,439 --> 00:08:27,601 like GoogLeNet and ResNet, 250 00:08:27,601 --> 00:08:29,739 they do away with a lot of these large 251 00:08:29,739 --> 00:08:31,110 fully connected layers 252 00:08:31,110 --> 00:08:32,611 in favor of global average pooling 253 00:08:32,611 --> 00:08:33,698 at the end of the network. 254 00:08:33,698 --> 00:08:35,200 And this allows these networks to really cut, 255 00:08:35,201 --> 00:08:36,971 these nicer architectures, 256 00:08:36,971 --> 00:08:39,019 to really cut down the parameter count 257 00:08:39,019 --> 00:08:40,936 in these architectures. 258 00:08:44,463 --> 00:08:46,641 So that was kind of our brief recap 259 00:08:46,641 --> 00:08:48,771 of the CNN architectures that we saw last lecture, 260 00:08:48,771 --> 00:08:49,604 and then today, 261 00:08:49,604 --> 00:08:50,462 we're going to move to 262 00:08:50,462 --> 00:08:52,891 one of my favorite topics to talk about, 263 00:08:52,891 --> 00:08:56,321 which is recurrent neural networks. 264 00:08:56,321 --> 00:08:57,502 So, so far in this class, 265 00:08:57,502 --> 00:08:59,342 we've seen, what I like to think of 266 00:08:59,342 --> 00:09:01,759 as kind of a vanilla feed forward network, 267 00:09:01,759 --> 00:09:03,222 all of our network architectures have this flavor, 268 00:09:03,222 --> 00:09:05,160 where we receive some input 269 00:09:05,160 --> 00:09:07,353 and that input is a fixed size object, 270 00:09:07,353 --> 00:09:08,593 like an image or vector. 271 00:09:08,593 --> 00:09:11,691 That input is fed through some set of hidden layers 272 00:09:11,691 --> 00:09:13,850 and produces a single output, 273 00:09:13,850 --> 00:09:14,851 like a classifications, 274 00:09:14,851 --> 00:09:16,793 like a set of classifications scores 275 00:09:16,793 --> 00:09:18,876 over a set of categories. 276 00:09:20,071 --> 00:09:22,193 But in some context in machine learning, 277 00:09:22,193 --> 00:09:23,921 we want to have more flexibility 278 00:09:23,921 --> 00:09:25,942 in the types of data that our models can process. 279 00:09:25,942 --> 00:09:29,193 So once we move to this idea of recurrent neural networks, 280 00:09:29,193 --> 00:09:31,121 we have a lot more opportunities 281 00:09:31,121 --> 00:09:33,571 to play around with the types of input and output data 282 00:09:33,571 --> 00:09:35,313 that our networks can handle. 283 00:09:35,313 --> 00:09:37,329 So once we have recurrent neural networks, 284 00:09:37,329 --> 00:09:41,009 we can do what we call these one to many models. 285 00:09:41,009 --> 00:09:44,161 Or where maybe our input is some object of fixed size, 286 00:09:44,161 --> 00:09:45,032 like an image, 287 00:09:45,032 --> 00:09:47,500 but now our output is a sequence of variable length, 288 00:09:47,500 --> 00:09:48,721 such as a caption. 289 00:09:48,721 --> 00:09:49,801 Where different captions 290 00:09:49,801 --> 00:09:51,433 might have different numbers of words, 291 00:09:51,433 --> 00:09:54,081 so our output needs to be variable in length. 292 00:09:54,081 --> 00:09:56,491 We also might have many to one models, 293 00:09:56,491 --> 00:09:58,622 where our input could be variably sized. 294 00:09:58,622 --> 00:10:01,001 This might be something like a piece of text, 295 00:10:01,001 --> 00:10:03,782 and we want to say what is the sentiment of that text, 296 00:10:03,782 --> 00:10:06,161 whether it's positive or negative in sentiment. 297 00:10:06,161 --> 00:10:07,771 Or in a computer vision context, 298 00:10:07,771 --> 00:10:09,401 you might imagine taking as input a video, 299 00:10:09,401 --> 00:10:12,512 and that video might have a variable number of frames. 300 00:10:12,512 --> 00:10:14,921 And now we want to read this entire video 301 00:10:14,921 --> 00:10:16,401 of potentially variable length. 302 00:10:16,401 --> 00:10:17,439 And then at the end, 303 00:10:17,439 --> 00:10:18,742 make a classification decision 304 00:10:18,742 --> 00:10:20,142 about maybe what kind of activity or action 305 00:10:20,142 --> 00:10:22,721 is going on in that video. 306 00:10:22,721 --> 00:10:26,702 We also have a, we might also have problems 307 00:10:26,702 --> 00:10:28,091 where we want both the inputs 308 00:10:28,091 --> 00:10:29,931 and the output to be variable in length. 309 00:10:29,931 --> 00:10:31,491 We might see something like this 310 00:10:31,491 --> 00:10:33,433 in machine translation, 311 00:10:33,433 --> 00:10:35,870 where our input is some, maybe, sentence in English, 312 00:10:35,870 --> 00:10:37,302 which could have a variable length, 313 00:10:37,302 --> 00:10:39,393 and our output is maybe some sentence in French, 314 00:10:39,393 --> 00:10:41,633 which also could have a variable length. 315 00:10:41,633 --> 00:10:44,032 And crucially, the length of the English sentence 316 00:10:44,032 --> 00:10:46,801 might be different from the length of the French sentence. 317 00:10:46,801 --> 00:10:48,710 So we need some models that have the capacity 318 00:10:48,710 --> 00:10:50,782 to accept both variable length sequences 319 00:10:50,782 --> 00:10:53,931 on the input and on the output. 320 00:10:53,931 --> 00:10:56,331 Finally, we might also consider problems where 321 00:10:56,331 --> 00:10:58,153 our input is variably length, 322 00:10:58,153 --> 00:10:59,891 like something like a video sequence 323 00:10:59,891 --> 00:11:01,553 with a variable number of frames. 324 00:11:01,553 --> 00:11:02,920 And now we want to make a decision 325 00:11:02,920 --> 00:11:04,771 for each element of that input sequence. 326 00:11:04,771 --> 00:11:06,721 So in the context of videos, 327 00:11:06,721 --> 00:11:09,201 that might be making some classification decision 328 00:11:09,201 --> 00:11:11,891 along every frame of the video. 329 00:11:11,891 --> 00:11:13,281 And recurrent neural networks 330 00:11:13,281 --> 00:11:15,171 are this kind of general paradigm 331 00:11:15,171 --> 00:11:17,401 for handling variable sized sequence data 332 00:11:17,401 --> 00:11:19,302 that allow us to pretty naturally capture 333 00:11:19,302 --> 00:11:23,469 all of these different types of setups in our models. 334 00:11:24,349 --> 00:11:26,662 So recurring neural networks are actually important, 335 00:11:26,662 --> 00:11:30,030 even for some problems that have a fixed size input 336 00:11:30,030 --> 00:11:31,414 and a fixed size output. 337 00:11:31,414 --> 00:11:33,752 Recurrent neural networks can still be pretty useful. 338 00:11:33,752 --> 00:11:34,763 So in this example, 339 00:11:34,763 --> 00:11:35,982 we might want to do, for example, 340 00:11:35,982 --> 00:11:38,793 sequential processing of our input. 341 00:11:38,793 --> 00:11:40,721 So here, we're receiving a fixed size input 342 00:11:40,721 --> 00:11:41,774 like an image, 343 00:11:41,774 --> 00:11:43,732 and we want to make a classification decision 344 00:11:43,732 --> 00:11:46,227 about, like, what number is being shown in this image? 345 00:11:46,227 --> 00:11:48,854 But now, rather than just doing a single feed forward pass 346 00:11:48,854 --> 00:11:50,393 and making the decision all at once, 347 00:11:50,393 --> 00:11:52,702 this network is actually looking around the image 348 00:11:52,702 --> 00:11:55,553 and taking various glimpses of different parts of the image. 349 00:11:55,553 --> 00:11:58,233 And then after making some series of glimpses, 350 00:11:58,233 --> 00:11:59,882 then it makes its final decision 351 00:11:59,882 --> 00:12:01,742 as to what kind of number is present. 352 00:12:01,742 --> 00:12:03,233 So here, we had one, 353 00:12:03,233 --> 00:12:05,462 So here, even though our input and outputs, 354 00:12:05,462 --> 00:12:06,622 our input was an image, 355 00:12:06,622 --> 00:12:09,054 and our output was a classification decision, 356 00:12:09,054 --> 00:12:10,243 even this context, 357 00:12:10,243 --> 00:12:11,761 this idea of being able to handle 358 00:12:11,761 --> 00:12:14,121 variably length processing with recurrent neural networks 359 00:12:14,121 --> 00:12:17,473 can lead to some really interesting types of models. 360 00:12:17,473 --> 00:12:20,273 There's a really cool paper that I like 361 00:12:20,273 --> 00:12:22,140 that applied this same type of idea 362 00:12:22,140 --> 00:12:23,923 to generating new images. 363 00:12:23,923 --> 00:12:27,083 Where now, we want the model to synthesize brand new images 364 00:12:27,083 --> 00:12:29,723 that look kind of like the images it saw in training, 365 00:12:29,723 --> 00:12:32,489 and we can use a recurrent neural network architecture 366 00:12:32,489 --> 00:12:34,222 to actually paint these output images 367 00:12:34,222 --> 00:12:36,254 sort of one piece at a time in the output. 368 00:12:36,254 --> 00:12:37,342 You can see that, 369 00:12:37,342 --> 00:12:40,260 even though our output is this fixed size image, 370 00:12:40,260 --> 00:12:42,942 we can have these models that are working over time 371 00:12:42,942 --> 00:12:46,380 to compute parts of the output one at a time sequentially. 372 00:12:46,380 --> 00:12:48,577 And we can use recurrent neural networds 373 00:12:48,577 --> 00:12:51,662 for that type of setup as well. 374 00:12:51,662 --> 00:12:54,054 So after this sort of cool pitch 375 00:12:54,054 --> 00:12:55,854 about all these cool things that RNNs can do, 376 00:12:55,854 --> 00:12:58,785 you might wonder, like what exactly are these things? 377 00:12:58,785 --> 00:13:00,353 So in general, a recurrent neural network 378 00:13:00,353 --> 00:13:04,163 is this little, has this little recurrent core cell 379 00:13:04,163 --> 00:13:06,272 and it will take some input x, 380 00:13:06,272 --> 00:13:08,734 feed that input into the RNN, 381 00:13:08,734 --> 00:13:11,382 and that RNN has some internal hidden state, 382 00:13:11,382 --> 00:13:14,198 and that internal hidden state will be updated 383 00:13:14,198 --> 00:13:17,641 every time that the RNN reads a new input. 384 00:13:17,641 --> 00:13:19,654 And that internal hidden state 385 00:13:19,654 --> 00:13:21,243 will be then fed back to the model 386 00:13:21,243 --> 00:13:23,980 the next time it reads an input. 387 00:13:23,980 --> 00:13:26,334 And frequently, we will want our RNN"s 388 00:13:26,334 --> 00:13:28,822 to also produce some output at every time step, 389 00:13:28,822 --> 00:13:31,043 so we'll have this pattern where it will read an input, 390 00:13:31,043 --> 00:13:32,219 update its hidden state, 391 00:13:32,219 --> 00:13:34,469 and then produce an output. 392 00:13:35,814 --> 00:13:37,213 So then the question is 393 00:13:37,213 --> 00:13:39,862 what is the functional form of this recurrence relation 394 00:13:39,862 --> 00:13:40,961 that we're computing? 395 00:13:40,961 --> 00:13:42,923 So inside this little green RNN block, 396 00:13:42,923 --> 00:13:45,062 we're computing some recurrence relation, 397 00:13:45,062 --> 00:13:46,443 with a function f. 398 00:13:46,443 --> 00:13:49,094 So this function f will depend on some weights, w. 399 00:13:49,094 --> 00:13:52,822 It will accept the previous hidden state, h t - 1, 400 00:13:52,822 --> 00:13:55,374 as well as the input at the current state, x t, 401 00:13:55,374 --> 00:13:56,683 and this will output 402 00:13:56,683 --> 00:13:59,782 the next hidden state, or the updated hidden state, 403 00:13:59,782 --> 00:14:01,420 that we call h t. 404 00:14:01,420 --> 00:14:02,253 And now, 405 00:14:02,253 --> 00:14:04,382 then as we read the next input, 406 00:14:04,382 --> 00:14:05,283 this hidden state, 407 00:14:05,283 --> 00:14:06,894 this new hidden state, h t, 408 00:14:06,894 --> 00:14:08,774 will then just be passed into the same function 409 00:14:08,774 --> 00:14:11,552 as we read the next input, x t plus one. 410 00:14:11,552 --> 00:14:13,894 And now, if we wanted to produce some output 411 00:14:13,894 --> 00:14:15,273 at every time step of this network, 412 00:14:15,273 --> 00:14:20,203 we might attach some additional fully connected layers 413 00:14:20,203 --> 00:14:21,797 that read in this h t at every time step. 414 00:14:21,797 --> 00:14:23,407 And make that decision based 415 00:14:23,407 --> 00:14:27,327 on the hidden state at every time step. 416 00:14:27,327 --> 00:14:29,018 And one thing to note is that 417 00:14:29,018 --> 00:14:31,087 we use the same function, f w, 418 00:14:31,087 --> 00:14:32,495 and the same weights, w, 419 00:14:32,495 --> 00:14:35,662 at every time step of the computation. 420 00:14:36,921 --> 00:14:39,706 So then kind of the simplest function form 421 00:14:39,706 --> 00:14:40,634 that you can imagine 422 00:14:40,634 --> 00:14:43,434 is what we call this vanilla recurrent neural network. 423 00:14:43,434 --> 00:14:45,826 So here, we have this same functional form 424 00:14:45,826 --> 00:14:46,866 from the previous slide, 425 00:14:46,866 --> 00:14:49,191 where we're taking in our previous hidden state 426 00:14:49,191 --> 00:14:50,145 and our current input 427 00:14:50,145 --> 00:14:52,483 and we need to produce the next hidden state. 428 00:14:52,483 --> 00:14:54,954 And the kind of simplest thing you might imagine 429 00:14:54,954 --> 00:14:57,724 is that we have some weight matrix, w x h, 430 00:14:57,724 --> 00:15:00,124 that we multiply against the input, x t, 431 00:15:00,124 --> 00:15:03,162 as well as another weight matrix, w h h, 432 00:15:03,162 --> 00:15:05,615 that we multiply against the previous hidden state. 433 00:15:05,615 --> 00:15:07,226 So we make these two multiplications 434 00:15:07,226 --> 00:15:08,494 against our two states, 435 00:15:08,494 --> 00:15:09,327 add them together, 436 00:15:09,327 --> 00:15:10,924 and squash them through a tanh, 437 00:15:10,924 --> 00:15:13,514 so we get some kind of non linearity in the system. 438 00:15:13,514 --> 00:15:15,287 You might be wondering why we use a tanh here 439 00:15:15,287 --> 00:15:17,312 and not some other type of non-linearity? 440 00:15:17,312 --> 00:15:18,824 After all that we've said negative 441 00:15:18,824 --> 00:15:20,594 about tanh's in previous lectures, 442 00:15:20,594 --> 00:15:22,424 and I think we'll return a little bit to that 443 00:15:22,424 --> 00:15:23,257 later on when we talk about 444 00:15:23,257 --> 00:15:26,507 more advanced architectures, like lstm. 445 00:15:27,346 --> 00:15:28,695 So then, this, 446 00:15:28,695 --> 00:15:30,872 So then, in addition in this architecture, 447 00:15:30,872 --> 00:15:33,394 if we wanted to produce some y t at every time step, 448 00:15:33,394 --> 00:15:35,284 you might have another weight matrix, w, 449 00:15:35,284 --> 00:15:38,055 you might have another weight matrix 450 00:15:38,055 --> 00:15:39,055 that accepts this hidden state 451 00:15:39,055 --> 00:15:40,375 and then transforms it to some y 452 00:15:40,375 --> 00:15:42,724 to produce maybe some class score predictions 453 00:15:42,724 --> 00:15:44,826 at every time step. 454 00:15:44,826 --> 00:15:46,723 And when I think about recurrent neural networks, 455 00:15:46,723 --> 00:15:48,775 I kind of think about, you can also, 456 00:15:48,775 --> 00:15:50,634 you can kind of think of recurrent neural networks 457 00:15:50,634 --> 00:15:51,487 in two ways. 458 00:15:51,487 --> 00:15:53,713 One is this concept of having a hidden state 459 00:15:53,713 --> 00:15:57,095 that feeds back at itself, recurrently. 460 00:15:57,095 --> 00:15:59,135 But I find that picture a little bit confusing. 461 00:15:59,135 --> 00:16:01,073 And sometimes, I find it clearer 462 00:16:01,073 --> 00:16:04,546 to think about unrolling this computational graph 463 00:16:04,546 --> 00:16:05,914 for multiple time steps. 464 00:16:05,914 --> 00:16:08,382 And this makes the data flow of the hidden states 465 00:16:08,382 --> 00:16:09,914 and the inputs and the outputs and the weights 466 00:16:09,914 --> 00:16:11,786 maybe a little bit more clear. 467 00:16:11,786 --> 00:16:13,095 So then at the first time step, 468 00:16:13,095 --> 00:16:15,494 we'll have some initial hidden state h zero. 469 00:16:15,494 --> 00:16:18,914 This is usually initialized to zeros for most context, 470 00:16:18,914 --> 00:16:22,415 in most contexts, an then we'll have some input, x t. 471 00:16:22,415 --> 00:16:25,084 This initial hidden state, h zero, 472 00:16:25,084 --> 00:16:26,906 and our current input, x t, 473 00:16:26,906 --> 00:16:28,324 will go into our f w function. 474 00:16:28,324 --> 00:16:32,604 This will produce our next hidden state, h one. 475 00:16:32,604 --> 00:16:34,034 And then, we'll repeat this process 476 00:16:34,034 --> 00:16:36,154 when we receive the next input. 477 00:16:36,154 --> 00:16:38,295 So now our current h one and our x one, 478 00:16:38,295 --> 00:16:40,036 will go into that same f w, 479 00:16:40,036 --> 00:16:42,847 to produce our next output, h two. 480 00:16:42,847 --> 00:16:45,866 And this process will repeat over and over again, 481 00:16:45,866 --> 00:16:48,365 as we consume all of the input, x ts, 482 00:16:48,365 --> 00:16:50,866 in our sequence of inputs. 483 00:16:50,866 --> 00:16:52,116 And now, one thing to note, is that 484 00:16:52,116 --> 00:16:54,756 we can actually make this even more explicit 485 00:16:54,756 --> 00:16:58,036 and write the w matrix in our computational graph. 486 00:16:58,036 --> 00:16:59,058 And here you can see that 487 00:16:59,058 --> 00:17:01,434 we're re-using the same w matrix 488 00:17:01,434 --> 00:17:03,415 at every time step of the computation. 489 00:17:03,415 --> 00:17:06,145 So now every time that we have this little f w block, 490 00:17:06,145 --> 00:17:08,724 it's receiving a unique h and a unique x, 491 00:17:08,724 --> 00:17:11,006 but all of these blocks are taking the same w. 492 00:17:11,007 --> 00:17:12,236 And if you remember, 493 00:17:12,236 --> 00:17:16,116 we talked about how gradient flows in back propagation, 494 00:17:16,116 --> 00:17:17,058 when you re-use the same, 495 00:17:17,058 --> 00:17:19,218 when you re-use the same node multiple times 496 00:17:19,218 --> 00:17:20,786 in a computational graph, 497 00:17:20,786 --> 00:17:22,257 then remember during the backward pass, 498 00:17:22,257 --> 00:17:24,047 you end up summing the gradients 499 00:17:24,047 --> 00:17:25,257 into the w matrix 500 00:17:25,257 --> 00:17:28,218 when you're computing a d los d w. 501 00:17:28,218 --> 00:17:30,557 So, if you kind of think about 502 00:17:30,557 --> 00:17:32,526 the back propagation for this model, 503 00:17:32,526 --> 00:17:34,555 then you'll have a separate gradient for w 504 00:17:34,555 --> 00:17:36,506 flowing from each of those time steps, 505 00:17:36,506 --> 00:17:38,276 and then the final gradient for w 506 00:17:38,276 --> 00:17:39,586 will be the sum of all of those 507 00:17:39,586 --> 00:17:42,503 individual per time step gradiants. 508 00:17:43,615 --> 00:17:46,073 We can also write to this y t explicitly 509 00:17:46,073 --> 00:17:47,727 in this computational graph. 510 00:17:47,727 --> 00:17:50,668 So then, this output, h t, at every time step 511 00:17:50,668 --> 00:17:52,967 might feed into some other little neural network 512 00:17:52,967 --> 00:17:54,858 that can produce a y t, 513 00:17:54,858 --> 00:17:57,256 which might be some class scores, or something like that, 514 00:17:57,256 --> 00:17:59,087 at every time step. 515 00:17:59,087 --> 00:18:00,738 We can also make the loss more explicit. 516 00:18:00,738 --> 00:18:03,167 So in many cases, you might imagine producing, 517 00:18:03,167 --> 00:18:05,778 you might imagine that you have some ground truth label 518 00:18:05,778 --> 00:18:07,588 at every time step of your sequence, 519 00:18:07,588 --> 00:18:08,978 and then you'll compute some loss, 520 00:18:08,978 --> 00:18:10,487 some individual loss, 521 00:18:10,487 --> 00:18:11,596 at every time step of 522 00:18:11,596 --> 00:18:14,068 these outputs, y t's. 523 00:18:14,068 --> 00:18:14,915 And this loss might, 524 00:18:14,915 --> 00:18:17,556 it will frequently be something like soft max loss, 525 00:18:17,556 --> 00:18:19,497 in the case where you have, maybe, 526 00:18:19,497 --> 00:18:22,497 a ground truth label at every time step of the sequence. 527 00:18:22,497 --> 00:18:24,076 And now the final loss for the entire, 528 00:18:24,076 --> 00:18:25,395 for this entire training stop, 529 00:18:25,395 --> 00:18:27,887 will be the sum of these individual losses. 530 00:18:27,887 --> 00:18:30,485 So now, we had a scaler loss at every time step? 531 00:18:30,485 --> 00:18:32,818 And we just summed them up to get our final 532 00:18:32,818 --> 00:18:34,196 scaler loss at the top of the network. 533 00:18:34,196 --> 00:18:36,207 And now, if you think about, 534 00:18:36,207 --> 00:18:37,898 again, back propagation through this thing, 535 00:18:37,898 --> 00:18:38,896 we need, in order to train the model, 536 00:18:38,896 --> 00:18:40,215 we need to compute the gradient 537 00:18:40,215 --> 00:18:42,098 of the loss with respect to w. 538 00:18:42,098 --> 00:18:44,498 So, we'll have loss flowing from that final loss 539 00:18:44,498 --> 00:18:46,178 into each of these time steps. 540 00:18:46,178 --> 00:18:47,708 And then each of those time steps 541 00:18:47,708 --> 00:18:49,840 will compute a local gradient on the weights, w, 542 00:18:49,840 --> 00:18:52,010 which will all then be summed to give us our final 543 00:18:52,010 --> 00:18:54,343 gradient for the weights, w. 544 00:18:55,597 --> 00:18:58,268 Now if we have a, sort of, this many to one situation, 545 00:18:58,268 --> 00:19:01,188 where maybe we want to do something like sentiment analysis, 546 00:19:01,188 --> 00:19:03,348 then we would typically make that decision 547 00:19:03,348 --> 00:19:05,799 based on the final hidden state of this network. 548 00:19:05,799 --> 00:19:07,450 Because this final hidden state 549 00:19:07,450 --> 00:19:08,988 kind of summarizes all of the context 550 00:19:08,988 --> 00:19:11,868 from the entire sequence. 551 00:19:11,868 --> 00:19:14,788 Also, if we have a kind of a one to many situation, 552 00:19:14,788 --> 00:19:17,319 where we want to receive a fix sized input 553 00:19:17,319 --> 00:19:19,319 and then produce a variably sized output. 554 00:19:19,319 --> 00:19:22,698 Then you'll commonly use that fixed size input 555 00:19:22,698 --> 00:19:23,988 to initialize, somehow, 556 00:19:23,988 --> 00:19:26,050 the initial hidden state of the model, 557 00:19:26,050 --> 00:19:27,986 and now the recurrent network will tick 558 00:19:27,986 --> 00:19:30,079 for each cell in the output. 559 00:19:30,079 --> 00:19:32,748 And now, as you produce your variably sized output, 560 00:19:32,748 --> 00:19:36,915 you'll unroll the graph for each element in the output. 561 00:19:38,490 --> 00:19:41,370 So this, when we talk about the sequence to sequence models 562 00:19:41,370 --> 00:19:44,308 where you might do something like machine translation, 563 00:19:44,308 --> 00:19:46,099 where you take a variably sized input 564 00:19:46,099 --> 00:19:47,648 and a variably sized output. 565 00:19:47,648 --> 00:19:49,300 You can think of this as a combination 566 00:19:49,300 --> 00:19:50,719 of the many to one, 567 00:19:50,719 --> 00:19:52,398 plus a one to many. 568 00:19:52,398 --> 00:19:54,889 So, we'll kind of proceed in two stages, 569 00:19:54,889 --> 00:19:56,900 what we call an encoder and a decoder. 570 00:19:56,900 --> 00:19:58,119 So if you're the encoder, 571 00:19:58,119 --> 00:20:00,119 we'll receive the variably sized input, 572 00:20:00,119 --> 00:20:02,159 which might be your sentence in English, 573 00:20:02,159 --> 00:20:04,311 and then summarize that entire sentence 574 00:20:04,311 --> 00:20:08,110 using the final hidden state of the encoder network. 575 00:20:08,110 --> 00:20:11,894 And now we're in this many to one situation 576 00:20:11,894 --> 00:20:14,396 where we've summarized this entire variably sized input 577 00:20:14,396 --> 00:20:15,769 in this single vector, 578 00:20:15,769 --> 00:20:17,449 and now, we have a second decoder network, 579 00:20:17,449 --> 00:20:19,409 which is a one to many situation, 580 00:20:19,409 --> 00:20:21,740 which will input that single vector 581 00:20:21,740 --> 00:20:23,111 summarizing the input sentence 582 00:20:23,111 --> 00:20:25,487 and now produce this variably sized output, 583 00:20:25,487 --> 00:20:28,969 which might be your sentence in another language. 584 00:20:28,969 --> 00:20:30,980 And now in this variably sized output, 585 00:20:30,980 --> 00:20:32,820 we might make some predictions at every time step, 586 00:20:32,820 --> 00:20:34,609 maybe about what word to use. 587 00:20:34,609 --> 00:20:36,631 And you can imagine kind of training this entire thing 588 00:20:36,631 --> 00:20:38,199 by unrolling this computational graph 589 00:20:38,199 --> 00:20:40,608 summing the losses at the output sequence 590 00:20:40,608 --> 00:20:44,692 and just performing back propagation, as usual. 591 00:20:44,692 --> 00:20:46,391 So as a bit of a concrete example, 592 00:20:46,391 --> 00:20:47,729 one thing that we frequently use 593 00:20:47,729 --> 00:20:48,961 recurrent neural networks for, 594 00:20:48,961 --> 00:20:50,940 is this problem called language modeling. 595 00:20:50,940 --> 00:20:52,990 So in the language modeling problem, 596 00:20:52,990 --> 00:20:55,060 we want to read some sequence of, 597 00:20:55,060 --> 00:20:58,151 we want to have our network, sort of, understand 598 00:20:58,151 --> 00:21:00,908 how to produce natural language. 599 00:21:00,908 --> 00:21:04,071 So in the, so this, this might happen at the character level 600 00:21:04,071 --> 00:21:06,601 where our model will produce characters one at a time. 601 00:21:06,601 --> 00:21:08,389 This might also happen at the word level 602 00:21:08,389 --> 00:21:10,769 where our model will produce words one at a time. 603 00:21:10,769 --> 00:21:12,129 But in a very simple example, 604 00:21:12,129 --> 00:21:14,740 you can imagine this character level language model 605 00:21:14,740 --> 00:21:15,980 where we want, 606 00:21:15,980 --> 00:21:18,081 where the network will read some sequence of characters 607 00:21:18,081 --> 00:21:19,388 and then it needs to predict, 608 00:21:19,388 --> 00:21:22,780 what will the next character be in this stream of text? 609 00:21:22,780 --> 00:21:24,700 So in this example, 610 00:21:24,700 --> 00:21:27,460 we have this very small vocabulary of four letters, 611 00:21:27,460 --> 00:21:31,213 h, e, l, and o, and we have this example training sequence 612 00:21:31,213 --> 00:21:33,884 of the word hello, h, e, l, l, o. 613 00:21:33,884 --> 00:21:34,911 So during training, 614 00:21:34,911 --> 00:21:37,586 when we're training this language model, 615 00:21:37,586 --> 00:21:42,100 we will feed the characters of this training sequence 616 00:21:42,100 --> 00:21:46,119 as inputs, as x ts, to out input of our, 617 00:21:46,119 --> 00:21:49,689 we'll feed the characters of our training sequence, 618 00:21:49,689 --> 00:21:52,191 these will be the x ts that we feed in as the inputs 619 00:21:52,191 --> 00:21:53,980 to our recurrent neural network. 620 00:21:53,980 --> 00:21:56,168 And then, each of these inputs, 621 00:21:56,168 --> 00:21:57,678 it's a letter, 622 00:21:57,678 --> 00:21:58,890 and we need to figure out a way 623 00:21:58,890 --> 00:22:01,039 to represent letters in our network. 624 00:22:01,039 --> 00:22:02,759 So what we'll typically do is figure out 625 00:22:02,759 --> 00:22:05,100 what is our total vocabulary. 626 00:22:05,100 --> 00:22:07,460 In this case, our vocabulary has four elements. 627 00:22:07,460 --> 00:22:09,967 And each letter will be represented by a vector 628 00:22:09,967 --> 00:22:12,589 that has zeros in every slot but one, 629 00:22:12,589 --> 00:22:15,031 and a one for the slot in the vocabulary 630 00:22:15,031 --> 00:22:16,628 corresponding to that letter. 631 00:22:16,628 --> 00:22:18,092 In this little example, 632 00:22:18,092 --> 00:22:20,751 since our vocab has the four letters, h, e, l, o, 633 00:22:20,751 --> 00:22:22,324 then our input sequence, 634 00:22:22,324 --> 00:22:25,374 the h is represented by a four element vector 635 00:22:25,374 --> 00:22:26,535 with a one in the first slot 636 00:22:26,535 --> 00:22:28,684 and zero's in the other three slots. 637 00:22:28,684 --> 00:22:29,876 And we use the same sort of pattern 638 00:22:29,876 --> 00:22:31,306 to represent all the different letters 639 00:22:31,306 --> 00:22:33,139 in the input sequence. 640 00:22:34,914 --> 00:22:37,021 Now, during this forward pass 641 00:22:37,021 --> 00:22:38,575 of what this network is doing, 642 00:22:38,575 --> 00:22:39,965 at the first time step, 643 00:22:39,965 --> 00:22:41,874 it will receive the input letter h. 644 00:22:41,874 --> 00:22:45,285 That will go into the first RNN, 645 00:22:45,285 --> 00:22:46,714 to the RNN cell, 646 00:22:46,714 --> 00:22:48,594 and then we'll produce this output, y t, 647 00:22:48,594 --> 00:22:50,285 which is the network making predictions 648 00:22:50,285 --> 00:22:52,525 about for each letter in the vocabulary, 649 00:22:52,525 --> 00:22:54,411 which letter does it think is most likely 650 00:22:54,411 --> 00:22:56,024 going to come next. 651 00:22:56,024 --> 00:22:57,245 In this example, 652 00:22:57,245 --> 00:22:59,330 the correct output letter was e 653 00:22:59,330 --> 00:23:01,405 because our training sequence was hello, 654 00:23:01,405 --> 00:23:04,488 but the model is actually predicting, 655 00:23:05,477 --> 00:23:06,310 I think it's actually 656 00:23:06,310 --> 00:23:07,850 predicting o as the most likely letter. 657 00:23:07,850 --> 00:23:09,690 So in this case, this prediction was wrong 658 00:23:09,690 --> 00:23:11,210 and we would use softmaxt loss 659 00:23:11,210 --> 00:23:13,889 to quantify our unhappiness with these predictions. 660 00:23:13,889 --> 00:23:15,192 The next time step, 661 00:23:15,192 --> 00:23:16,680 we would feed in the second letter 662 00:23:16,680 --> 00:23:18,031 in the training sequence, e, 663 00:23:18,031 --> 00:23:19,741 and this process will repeat. 664 00:23:19,741 --> 00:23:22,181 We'll now represent e as a vector. 665 00:23:22,181 --> 00:23:24,232 Use that input vector together 666 00:23:24,232 --> 00:23:25,649 with the previous hidden state 667 00:23:25,649 --> 00:23:27,271 to produce a new hidden state 668 00:23:27,271 --> 00:23:28,792 and now use the second hidden state 669 00:23:28,792 --> 00:23:30,032 to, again, make predictions 670 00:23:30,032 --> 00:23:31,912 over every letter in the vocabulary. 671 00:23:31,912 --> 00:23:34,232 In this case, because our training sequence was hello, 672 00:23:34,232 --> 00:23:35,712 after the letter e, 673 00:23:35,712 --> 00:23:36,810 we want our model to predict l. 674 00:23:36,810 --> 00:23:37,893 In this case, 675 00:23:40,183 --> 00:23:41,832 our model may have very low predictions 676 00:23:41,832 --> 00:23:44,244 for the letter l, so we would incur high loss. 677 00:23:44,244 --> 00:23:46,794 And you kind of repeat this process over and over, 678 00:23:46,794 --> 00:23:50,343 and if you train this model with many different sequences, 679 00:23:50,343 --> 00:23:52,023 then eventually it should learn 680 00:23:52,023 --> 00:23:54,303 how to predict the next character in a sequence 681 00:23:54,303 --> 00:23:56,763 based on the context of all the previous characters 682 00:23:56,763 --> 00:23:58,596 that it's seen before. 683 00:23:59,893 --> 00:24:01,655 And now, if you think about what happens at test time, 684 00:24:01,655 --> 00:24:03,388 after we train this model, 685 00:24:03,388 --> 00:24:05,033 one thing that we might want to do with it 686 00:24:05,033 --> 00:24:07,594 is a sample from the model, 687 00:24:07,594 --> 00:24:09,673 and actually use this trained neural network model 688 00:24:09,673 --> 00:24:11,764 to synthesize new text 689 00:24:11,764 --> 00:24:13,592 that kind of looks similar in spirit 690 00:24:13,592 --> 00:24:15,103 to the text that it was trained on. 691 00:24:15,103 --> 00:24:16,312 The way that this will work 692 00:24:16,312 --> 00:24:17,833 is we'll typically see the model 693 00:24:17,833 --> 00:24:19,634 with some input prefix of text. 694 00:24:19,634 --> 00:24:22,716 In this case, the prefix is just the single letter h, 695 00:24:22,716 --> 00:24:24,327 and now we'll feed that letter h 696 00:24:24,327 --> 00:24:27,295 through the first time step of our recurrent neural network. 697 00:24:27,295 --> 00:24:30,335 It will product this distribution of scores 698 00:24:30,335 --> 00:24:32,916 over all the characters in the vocabulary. 699 00:24:32,916 --> 00:24:35,592 Now, at training time, we'll use these scores 700 00:24:35,592 --> 00:24:37,501 to actually sample from it. 701 00:24:37,501 --> 00:24:38,893 So we'll use a softmaxt function 702 00:24:38,893 --> 00:24:41,421 to convert those scores into a probability distribution 703 00:24:41,421 --> 00:24:44,162 and then we will sample from that probability distribution 704 00:24:44,162 --> 00:24:47,362 to actually synthesize the second letter in the sequence. 705 00:24:47,362 --> 00:24:50,491 And in this case, even though the scores were pretty bad, 706 00:24:50,491 --> 00:24:53,061 maybe we got lucky and sampled the letter e 707 00:24:53,061 --> 00:24:54,771 from this probability distribution. 708 00:24:54,771 --> 00:24:56,962 And now, we'll take this letter e 709 00:24:56,962 --> 00:24:59,253 that was sampled from this distribution 710 00:24:59,253 --> 00:25:01,522 and feed it back as input into the network 711 00:25:01,522 --> 00:25:02,492 at the next time step. 712 00:25:02,492 --> 00:25:05,118 Now, we'll take this e, pull it down from the top, 713 00:25:05,118 --> 00:25:07,170 feed it back into the network 714 00:25:07,170 --> 00:25:10,071 as one of these, sort of, one hot vectorial representations, 715 00:25:10,071 --> 00:25:11,530 and then repeat the process 716 00:25:11,530 --> 00:25:15,008 in order to synthesize the second letter in the output. 717 00:25:15,008 --> 00:25:17,290 And we can repeat this process over and over again 718 00:25:17,290 --> 00:25:20,722 to synthesize a new sequence using this trained model, 719 00:25:20,722 --> 00:25:22,442 where we're synthesizing the sequence 720 00:25:22,442 --> 00:25:23,770 one character at a time 721 00:25:23,770 --> 00:25:25,922 using these predicted probability distributions 722 00:25:25,922 --> 00:25:27,712 at each time step. 723 00:25:27,712 --> 00:25:28,545 Question? 724 00:25:34,792 --> 00:25:36,181 Yeah, that's a great question. 725 00:25:36,181 --> 00:25:37,960 So the question is why might we sample 726 00:25:37,960 --> 00:25:39,381 instead of just taking the character 727 00:25:39,381 --> 00:25:41,315 with the largest score? 728 00:25:41,315 --> 00:25:42,398 In this case, 729 00:25:43,294 --> 00:25:45,624 because of the probability distribution that we had, 730 00:25:45,624 --> 00:25:47,451 it was impossible to get the right character, 731 00:25:47,451 --> 00:25:49,642 so we had the sample so the example could work out, 732 00:25:49,642 --> 00:25:51,512 and it would make sense. 733 00:25:51,512 --> 00:25:54,384 But in practice, sometimes you'll see both. 734 00:25:54,384 --> 00:25:56,432 So sometimes you'll just take the argmax probability, 735 00:25:56,432 --> 00:25:59,482 and that will sometimes be a little bit more stable, 736 00:25:59,482 --> 00:26:01,791 but one advantage of sampling, in general, 737 00:26:01,791 --> 00:26:04,264 is that it lets you get diversity from your models. 738 00:26:04,264 --> 00:26:07,242 Sometimes you might have the same input, 739 00:26:07,242 --> 00:26:08,490 maybe the same prefix, 740 00:26:08,490 --> 00:26:09,722 or in the case of image captioning, 741 00:26:09,722 --> 00:26:11,421 maybe the same image. 742 00:26:11,421 --> 00:26:13,583 But then if you sample rather than taking the argmax, 743 00:26:13,583 --> 00:26:15,562 then you'll see that sometimes these trained models 744 00:26:15,562 --> 00:26:18,352 are actually able to produce multiple different types 745 00:26:18,352 --> 00:26:20,032 of reasonable output sequences, 746 00:26:20,032 --> 00:26:21,053 depending on the kind, 747 00:26:21,053 --> 00:26:22,573 depending on which samples they take 748 00:26:22,573 --> 00:26:23,824 at the first time steps. 749 00:26:23,824 --> 00:26:25,973 It's actually kind of a benefit 750 00:26:25,973 --> 00:26:29,213 cause we can get now more diversity in our outputs. 751 00:26:29,213 --> 00:26:30,630 Another question? 752 00:26:35,143 --> 00:26:37,153 Could we feed in the softmax vector 753 00:26:37,153 --> 00:26:38,602 instead of the one element vector? 754 00:26:38,602 --> 00:26:40,435 You mean at test time? 755 00:26:46,162 --> 00:26:47,992 Yeah yeah, so the question is, at test time, 756 00:26:47,992 --> 00:26:49,712 could we feed in this whole softmax vector 757 00:26:49,712 --> 00:26:51,373 rather than a one hot vector? 758 00:26:51,373 --> 00:26:52,752 There's kind of two problems with that. 759 00:26:52,752 --> 00:26:54,762 One is that that's very different 760 00:26:54,762 --> 00:26:56,782 from the data that it saw at training time. 761 00:26:56,782 --> 00:26:58,792 In general, if you ask your model 762 00:26:58,792 --> 00:26:59,943 to do something at test time, 763 00:26:59,943 --> 00:27:01,223 which is different from training time, 764 00:27:01,223 --> 00:27:02,752 then it'll usually blow up. 765 00:27:02,752 --> 00:27:03,784 It'll usually give you garbage 766 00:27:03,784 --> 00:27:05,413 and you'll usually be sad. 767 00:27:05,413 --> 00:27:07,083 The other problem is that in practice, 768 00:27:07,083 --> 00:27:09,112 our vocabularies might be very large. 769 00:27:09,112 --> 00:27:10,480 So maybe, in this simple example, 770 00:27:10,480 --> 00:27:12,182 our vocabulary is only four elements, 771 00:27:12,182 --> 00:27:13,202 so it's not a big problem. 772 00:27:13,202 --> 00:27:15,611 But if you're thinking about generating words one at a time, 773 00:27:15,611 --> 00:27:18,773 now your vocabulary is every word in the English language, 774 00:27:18,773 --> 00:27:21,162 which could be something like tens of thousands of elements. 775 00:27:21,162 --> 00:27:23,642 So in practice, this first element, 776 00:27:23,642 --> 00:27:26,373 this first operation that's taking in this one hot vector, 777 00:27:26,373 --> 00:27:29,173 is often performed using sparse vector operations 778 00:27:29,173 --> 00:27:30,533 rather than dense factors. 779 00:27:30,533 --> 00:27:32,693 It would be, sort of, computationally really bad 780 00:27:32,693 --> 00:27:35,018 if you wanted to have this load of 781 00:27:35,018 --> 00:27:36,827 10,000 elements softmax vector. 782 00:27:36,827 --> 00:27:38,802 So that's usually why we use a one hot instead, 783 00:27:38,802 --> 00:27:40,302 even at test time. 784 00:27:42,121 --> 00:27:44,482 This idea that we have a sequence 785 00:27:44,482 --> 00:27:47,104 and we produce an output at every time step of the sequence 786 00:27:47,104 --> 00:27:48,693 and then finally compute some loss, 787 00:27:48,693 --> 00:27:51,543 this is sometimes called backpropagation through time 788 00:27:51,543 --> 00:27:53,962 because you're imagining that in the forward pass, 789 00:27:53,962 --> 00:27:55,819 you're kind of stepping forward through time 790 00:27:55,819 --> 00:27:57,053 and then during the backward pass, 791 00:27:57,053 --> 00:27:59,042 you're sort of going backwards through time 792 00:27:59,042 --> 00:28:00,762 to compute all your gradients. 793 00:28:00,762 --> 00:28:03,389 This can actually be kind of problematic 794 00:28:03,389 --> 00:28:06,162 if you want to train the sequences that are very, very long. 795 00:28:06,162 --> 00:28:08,333 So if you imagine that we were kind of trying to train 796 00:28:08,333 --> 00:28:09,602 a neural network language model 797 00:28:09,602 --> 00:28:12,107 on maybe the entire text of Wikipedia, 798 00:28:12,107 --> 00:28:13,171 which is, by the way, 799 00:28:13,171 --> 00:28:15,453 something that people do pretty frequently, 800 00:28:15,453 --> 00:28:16,672 this would be super slow, 801 00:28:16,672 --> 00:28:19,282 and every time we made a gradient step, 802 00:28:19,282 --> 00:28:20,670 we would have to make a forward pass 803 00:28:20,670 --> 00:28:23,328 through the entire text of all of wikipedia, 804 00:28:23,328 --> 00:28:25,901 and then make a backward pass through all of wikipedia, 805 00:28:25,901 --> 00:28:27,813 and then make a single gradient update. 806 00:28:27,813 --> 00:28:28,750 And that would be super slow. 807 00:28:28,750 --> 00:28:29,984 Your model would never converge. 808 00:28:29,984 --> 00:28:31,853 It would also take a ridiculous amount of memory 809 00:28:31,853 --> 00:28:34,172 so this would be just really bad. 810 00:28:34,172 --> 00:28:36,384 In practice, what people do is this, sort of, 811 00:28:36,384 --> 00:28:39,933 approximation called truncated backpropagation through time. 812 00:28:39,933 --> 00:28:41,522 Here, the idea is that, 813 00:28:41,522 --> 00:28:43,973 even though our input sequence is very, very long, 814 00:28:43,973 --> 00:28:45,562 and even potentially infinite, 815 00:28:45,562 --> 00:28:47,184 what we'll do is that during, 816 00:28:47,184 --> 00:28:48,521 when we're training the model, 817 00:28:48,521 --> 00:28:50,544 we'll step forward for some number of steps, 818 00:28:50,544 --> 00:28:54,402 maybe like a hundred is kind of a ballpark number 819 00:28:54,402 --> 00:28:56,232 that people frequently use, 820 00:28:56,232 --> 00:28:58,623 and we'll step forward for maybe a hundred steps, 821 00:28:58,623 --> 00:29:01,944 compute a loss only over this sub sequence of the data, 822 00:29:01,944 --> 00:29:04,312 and then back propagate through this sub sequence, 823 00:29:04,312 --> 00:29:06,261 and now make a gradient step. 824 00:29:06,261 --> 00:29:08,102 And now, when we repeat, well, 825 00:29:08,102 --> 00:29:10,072 we still have these hidden states 826 00:29:10,072 --> 00:29:12,064 that we computed from the first batch, 827 00:29:12,064 --> 00:29:14,664 and now, when we compute this next batch of data, 828 00:29:14,664 --> 00:29:17,933 we will carry those hidden states forward in time, 829 00:29:17,933 --> 00:29:20,631 so the forward pass will be exactly the same. 830 00:29:20,631 --> 00:29:23,184 But now when we compute a gradient step 831 00:29:23,184 --> 00:29:24,392 for this next batch of data, 832 00:29:24,392 --> 00:29:28,124 we will only backpropagate again through this second batch. 833 00:29:28,124 --> 00:29:29,659 Now, we'll make a gradient step 834 00:29:29,659 --> 00:29:32,760 based on this truncated backpropagation through time. 835 00:29:32,760 --> 00:29:34,392 This process will continue, 836 00:29:34,392 --> 00:29:36,440 where now when we make the next batch, 837 00:29:36,440 --> 00:29:38,250 we'll again copy these hidden states forward, 838 00:29:38,250 --> 00:29:40,930 but then step forward and then step backward, 839 00:29:40,930 --> 00:29:43,840 but only for some small number of time steps. 840 00:29:43,840 --> 00:29:44,673 So this is, 841 00:29:44,673 --> 00:29:45,600 you can kind of think of this 842 00:29:45,600 --> 00:29:48,250 as being an alegist who's the cast at gradient descent 843 00:29:48,250 --> 00:29:49,872 in the case of sequences. 844 00:29:49,872 --> 00:29:52,280 Remember, when we talked about training our models 845 00:29:52,280 --> 00:29:53,690 on large data sets, 846 00:29:53,690 --> 00:29:54,672 then these data sets, 847 00:29:54,672 --> 00:29:56,880 it would be super expensive to compute the gradients 848 00:29:56,880 --> 00:29:58,720 over every element in the data set. 849 00:29:58,720 --> 00:30:00,861 So instead, we kind of take small samples, 850 00:30:00,861 --> 00:30:02,520 small mini batches instead, 851 00:30:02,520 --> 00:30:05,290 and use mini batches of data to compute gradient stops 852 00:30:05,290 --> 00:30:08,053 in any kind of image classification case. 853 00:30:08,053 --> 00:30:08,886 Question? 854 00:30:12,441 --> 00:30:14,415 Is this kind of, the question is, 855 00:30:14,415 --> 00:30:15,989 is this kind of making the Mark Hobb assumption? 856 00:30:15,989 --> 00:30:17,239 No, not really. 857 00:30:18,100 --> 00:30:20,001 Because we're carrying this hidden state forward 858 00:30:20,001 --> 00:30:21,442 in time forever. 859 00:30:21,442 --> 00:30:23,512 It's making a Marcovian assumption 860 00:30:23,512 --> 00:30:25,792 in the sense that, conditioned on the hidden state, 861 00:30:25,792 --> 00:30:27,733 but the hidden state is all that we need 862 00:30:27,733 --> 00:30:28,971 to predict the entire future 863 00:30:28,971 --> 00:30:31,101 of the sequence. 864 00:30:31,101 --> 00:30:33,019 But that assumption is kind of built 865 00:30:33,019 --> 00:30:35,012 into the recurrent neural network formula 866 00:30:35,012 --> 00:30:35,941 from the start. 867 00:30:35,941 --> 00:30:37,269 And that's not really particular 868 00:30:37,269 --> 00:30:39,032 to back propagation through time. 869 00:30:39,032 --> 00:30:40,372 Back propagation through time, 870 00:30:40,372 --> 00:30:43,061 or sorry, truncated back prop though time 871 00:30:43,061 --> 00:30:44,352 is just the way to approximate these gradients 872 00:30:44,352 --> 00:30:47,072 without going making a backwards pass 873 00:30:47,072 --> 00:30:51,239 through your potentially very large sequence of data. 874 00:30:52,677 --> 00:30:55,301 This all sounds very complicated and confusing 875 00:30:55,301 --> 00:30:56,871 and it sounds like a lot of code to write, 876 00:30:56,871 --> 00:30:59,649 but in fact, this can acutally be pretty concise. 877 00:30:59,649 --> 00:31:03,418 Andrea has this example of what he calls min-char-rnn, 878 00:31:03,418 --> 00:31:04,816 that does all of this stuff 879 00:31:04,816 --> 00:31:07,474 in just like a 112 lines of Python. 880 00:31:07,474 --> 00:31:09,045 It handles building the vocabulary. 881 00:31:09,045 --> 00:31:09,925 It trains the model 882 00:31:09,925 --> 00:31:11,725 with truncated back propagation through time. 883 00:31:11,725 --> 00:31:13,936 And then, it can actually sample from that model 884 00:31:13,936 --> 00:31:16,584 in actually not too much code. 885 00:31:16,584 --> 00:31:17,925 So even though this sounds like 886 00:31:17,925 --> 00:31:18,965 kind of a big, scary process, 887 00:31:18,965 --> 00:31:20,862 it's actually not too difficult. 888 00:31:20,862 --> 00:31:22,314 I'd encourage you, if you're confused, 889 00:31:22,314 --> 00:31:23,456 to maybe go check this out 890 00:31:23,456 --> 00:31:25,125 and step through the code on your own time, 891 00:31:25,125 --> 00:31:27,074 and see, kind of, all of these concrete steps 892 00:31:27,074 --> 00:31:27,954 happening in code. 893 00:31:27,954 --> 00:31:30,184 So this is all in just a single file, 894 00:31:30,184 --> 00:31:31,723 all using numpy with no dependencies. 895 00:31:31,723 --> 00:31:34,473 This was relatively easy to read. 896 00:31:35,584 --> 00:31:36,896 So then, once we have this idea 897 00:31:36,896 --> 00:31:39,056 of training a recurrent neural network language model, 898 00:31:39,056 --> 00:31:41,593 we can actually have a lot of fun with this. 899 00:31:41,593 --> 00:31:43,984 And we can take in, sort of, any text that we want. 900 00:31:43,984 --> 00:31:46,424 Take in, like, whatever random text you can think of 901 00:31:46,424 --> 00:31:47,384 from the internet, 902 00:31:47,384 --> 00:31:49,656 train our recurrent neural network language model 903 00:31:49,656 --> 00:31:50,616 on this text, 904 00:31:50,616 --> 00:31:52,304 and then generate new text. 905 00:31:52,304 --> 00:31:55,502 So in this example, we took this entire text 906 00:31:55,502 --> 00:31:57,394 of all of Shakespeare's works, 907 00:31:57,394 --> 00:31:59,314 and then used that to train 908 00:31:59,314 --> 00:32:00,885 a recurrent neural network language model 909 00:32:00,885 --> 00:32:02,634 on all of Shakespeare. 910 00:32:02,634 --> 00:32:03,896 And you can see that the beginning of training, 911 00:32:03,896 --> 00:32:05,773 it's kind of producing maybe random gibberish garbage, 912 00:32:05,773 --> 00:32:08,216 but throughout the course of training, 913 00:32:08,216 --> 00:32:11,584 it ends up producing things that seem relatively reasonable. 914 00:32:11,584 --> 00:32:12,784 And after you've, 915 00:32:12,784 --> 00:32:14,272 after this model has been trained pretty well, 916 00:32:14,272 --> 00:32:16,104 then it produces text that seems, 917 00:32:16,104 --> 00:32:18,224 kind of, Shakespeare-esque to me. 918 00:32:18,224 --> 00:32:20,264 "Why do what that day," replied, 919 00:32:20,264 --> 00:32:23,184 whatever, right, you can read this. 920 00:32:23,184 --> 00:32:24,754 Like, it kind of looks kind of like Shakespeare. 921 00:32:24,754 --> 00:32:27,184 And if you actually train this model even more, 922 00:32:27,184 --> 00:32:28,765 and let it converge even further, 923 00:32:28,765 --> 00:32:31,264 and then sample these even longer sequences, 924 00:32:31,264 --> 00:32:33,904 you can see that it learns all kinds of crazy cool stuff 925 00:32:33,904 --> 00:32:35,864 that really looks like a Shakespeare play. 926 00:32:35,864 --> 00:32:38,856 It knows that it uses, maybe, these headings 927 00:32:38,856 --> 00:32:40,016 to say who's speaking. 928 00:32:40,016 --> 00:32:42,224 Then it produces these bits of text 929 00:32:42,224 --> 00:32:43,736 that have crazy dialogue 930 00:32:43,736 --> 00:32:45,565 that sounds kind of Shakespeare-esque. 931 00:32:45,565 --> 00:32:46,714 It knows to put line breaks 932 00:32:46,714 --> 00:32:47,744 in between these different things. 933 00:32:47,744 --> 00:32:49,342 And this is all, like, really cool, 934 00:32:49,342 --> 00:32:52,958 all just sort of learned from the structure of the data. 935 00:32:52,958 --> 00:32:55,536 We can actually get even crazier than this. 936 00:32:55,536 --> 00:32:58,368 This was one of my favorite examples. 937 00:32:58,368 --> 00:33:00,443 I found online, there's this. 938 00:33:00,443 --> 00:33:02,725 Is anyone a mathematician in this room? 939 00:33:02,725 --> 00:33:05,914 Has anyone taken an algebraic topology course by any chance? 940 00:33:05,914 --> 00:33:07,325 Wow, a couple, that's impressive. 941 00:33:07,325 --> 00:33:10,765 So you probably know more algebraic topology than me, 942 00:33:10,765 --> 00:33:12,165 but I found this open source 943 00:33:12,165 --> 00:33:15,114 algebraic topology textbook online. 944 00:33:15,114 --> 00:33:16,514 It's just a whole bunch of tech files 945 00:33:16,514 --> 00:33:19,554 that are like this super dense mathematics. 946 00:33:19,554 --> 00:33:22,593 And LaTac, cause LaTac is sort of this, 947 00:33:22,593 --> 00:33:24,774 let's you write equations and diagrams 948 00:33:24,774 --> 00:33:26,325 and everything just using plain text. 949 00:33:26,325 --> 00:33:27,455 We can actually train our 950 00:33:27,455 --> 00:33:28,816 recurrent neural network language model 951 00:33:28,816 --> 00:33:31,457 on the raw Latac source code 952 00:33:31,457 --> 00:33:33,146 of this algebraic topology textbook. 953 00:33:33,146 --> 00:33:37,687 And if we do that, then after we sample from the model, 954 00:33:37,687 --> 00:33:39,208 then we get something that seems like, 955 00:33:39,208 --> 00:33:41,446 kind of like algebraic topology. 956 00:33:41,446 --> 00:33:43,797 So it knows to like put equations. 957 00:33:43,797 --> 00:33:46,679 It puts all kinds of crazy stuff. 958 00:33:46,679 --> 00:33:48,717 It's like, to prove study, 959 00:33:48,717 --> 00:33:51,574 we see that F sub U is a covering of x prime, 960 00:33:51,574 --> 00:33:52,773 blah, blah, blah, blah, blah. 961 00:33:52,773 --> 00:33:53,992 It knows where to put unions. 962 00:33:53,992 --> 00:33:56,576 It knows to put squares at the end of proofs. 963 00:33:56,576 --> 00:33:57,528 It makes lemmas. 964 00:33:57,528 --> 00:34:00,026 It makes references to previous lemmas. 965 00:34:00,026 --> 00:34:01,225 Right, like we hear, like. 966 00:34:01,225 --> 00:34:03,781 It's namely a bi-lemma question. 967 00:34:03,781 --> 00:34:05,917 We see that R is geometrically something. 968 00:34:05,917 --> 00:34:08,417 So it's actually pretty crazy. 969 00:34:09,496 --> 00:34:12,913 It also sometimes tries to make diagrams. 970 00:34:14,239 --> 00:34:16,132 For those of you that have taken algebraic topology, 971 00:34:16,132 --> 00:34:17,322 you know that these commutative diagrams 972 00:34:17,322 --> 00:34:19,313 are kind of a thing that you work with a lot 973 00:34:19,313 --> 00:34:21,153 So it kind of got the general gist 974 00:34:21,154 --> 00:34:22,353 of how to make those diagrams, 975 00:34:22,353 --> 00:34:26,379 but they actually don't make any sense. 976 00:34:26,380 --> 00:34:28,130 And actually, 977 00:34:28,130 --> 00:34:29,440 one of my favorite examples here 978 00:34:29,440 --> 00:34:32,728 is that it sometimes omits proofs. 979 00:34:32,728 --> 00:34:33,788 So it'll sometimes say, 980 00:34:33,789 --> 00:34:35,418 it'll sometimes say something like 981 00:34:35,418 --> 00:34:39,695 theorem, blah, blah, blah, blah, blah, proof omitted. 982 00:34:39,695 --> 00:34:41,559 This thing kind of has gotten the gist 983 00:34:41,559 --> 00:34:45,392 of how some of these math textbooks look like. 984 00:34:47,831 --> 00:34:49,122 We can have a lot of fun with this. 985 00:34:49,123 --> 00:34:50,792 So we also tried training one of these models 986 00:34:50,792 --> 00:34:53,498 on the entire source code of the Linux kernel. 987 00:34:53,498 --> 00:34:55,791 'Cause again, this character level stuff 988 00:34:55,792 --> 00:34:56,801 that we can train on, 989 00:34:56,801 --> 00:34:58,630 And then, when we sample this, 990 00:34:58,630 --> 00:35:01,483 it acutally again looks like C source code. 991 00:35:01,483 --> 00:35:03,534 It knows how to write if statements. 992 00:35:03,534 --> 00:35:06,192 It has, like, pretty good code formatting skills. 993 00:35:06,192 --> 00:35:08,110 It knows to indent after these if statements. 994 00:35:08,110 --> 00:35:09,552 It knows to put curly braces. 995 00:35:09,552 --> 00:35:12,289 It actually even makes comments about some things 996 00:35:12,289 --> 00:35:15,052 that are usually nonsense. 997 00:35:15,052 --> 00:35:18,089 One problem with this model is that 998 00:35:18,089 --> 00:35:19,842 it knows how to declare variables. 999 00:35:19,842 --> 00:35:23,355 But it doesn't always use the variables that it declares. 1000 00:35:23,355 --> 00:35:24,414 And sometimes it tries 1001 00:35:24,414 --> 00:35:26,134 to use variables that haven't been declared. 1002 00:35:26,134 --> 00:35:27,554 This wouldn't compile. 1003 00:35:27,554 --> 00:35:28,814 I would not recommend sending this 1004 00:35:28,814 --> 00:35:30,555 as a pull request to Linux. 1005 00:35:30,555 --> 00:35:34,606 This thing also figures out how to recite the GNU, 1006 00:35:34,606 --> 00:35:37,867 this GNU license character by character. 1007 00:35:37,867 --> 00:35:40,454 It kind of knows that you need to recite the GNU license 1008 00:35:40,454 --> 00:35:42,696 and after the license comes some includes, 1009 00:35:42,696 --> 00:35:44,962 then some other includes, then source code. 1010 00:35:44,962 --> 00:35:46,864 This thing has actually learned quite a lot 1011 00:35:46,864 --> 00:35:48,836 about the general structure of the data. 1012 00:35:48,836 --> 00:35:50,096 Where, again, during training, 1013 00:35:50,096 --> 00:35:51,518 all we asked this model to do 1014 00:35:51,518 --> 00:35:53,398 was try to predict the next character in the sequence. 1015 00:35:53,398 --> 00:35:55,406 We didn't tell it any of this structure, 1016 00:35:55,406 --> 00:35:57,534 but somehow, just through the course 1017 00:35:57,534 --> 00:35:58,856 of this training process, 1018 00:35:58,856 --> 00:36:01,438 it learned a lot about the latent structure 1019 00:36:01,438 --> 00:36:03,355 in the sequential data. 1020 00:36:05,808 --> 00:36:07,406 Yeah, so it knows how to write code. 1021 00:36:07,406 --> 00:36:10,246 It does a lot of cool stuff. 1022 00:36:10,246 --> 00:36:13,575 I had this paper with Andre a couple years ago 1023 00:36:13,575 --> 00:36:14,966 where we trained a bunch of these models 1024 00:36:14,966 --> 00:36:17,176 and then we wanted to try to poke into the brains 1025 00:36:17,176 --> 00:36:18,009 of these models 1026 00:36:18,009 --> 00:36:19,376 and figure out like what are they doing 1027 00:36:19,376 --> 00:36:20,856 and why are they working. 1028 00:36:20,856 --> 00:36:22,027 So we saw, in our, 1029 00:36:22,027 --> 00:36:23,386 these recurring neural networks 1030 00:36:23,386 --> 00:36:24,947 has this hidden vector which is, 1031 00:36:24,947 --> 00:36:28,165 maybe, some vector that's updated over every time step. 1032 00:36:28,165 --> 00:36:29,838 And then what we wanted to try to figure out is 1033 00:36:29,838 --> 00:36:32,158 could we find some elements of this vector 1034 00:36:32,158 --> 00:36:34,176 that have some Symantec interpretable meaning. 1035 00:36:34,176 --> 00:36:35,336 So what we did is 1036 00:36:35,336 --> 00:36:37,096 we trained a neural network language model, 1037 00:36:37,096 --> 00:36:38,507 one of these character level models 1038 00:36:38,507 --> 00:36:39,917 on one of these data sets, 1039 00:36:39,917 --> 00:36:42,965 and then we picked one of the elements in that hidden vector 1040 00:36:42,965 --> 00:36:45,406 and now we look at what is the value of that hidden vector 1041 00:36:45,406 --> 00:36:47,923 over the course of a sequence 1042 00:36:47,923 --> 00:36:49,926 to try to get some sense of maybe 1043 00:36:49,926 --> 00:36:52,206 what these different hidden states are looking for. 1044 00:36:52,206 --> 00:36:54,086 When you do this, a lot of them end up looking 1045 00:36:54,086 --> 00:36:56,406 kind of like random gibberish garbage. 1046 00:36:56,406 --> 00:36:57,736 So here again, what we've done, 1047 00:36:57,736 --> 00:36:59,806 is we've picked one element of that vector, 1048 00:36:59,806 --> 00:37:01,467 and now we run the sequence forward 1049 00:37:01,467 --> 00:37:02,686 through the trained model, 1050 00:37:02,686 --> 00:37:04,056 and now the color of each character 1051 00:37:04,056 --> 00:37:07,271 corresponds to the magnitude of that single 1052 00:37:07,271 --> 00:37:09,446 scaler element of the hidden vector at every time step 1053 00:37:09,446 --> 00:37:10,876 when it's reading the sequence. 1054 00:37:10,876 --> 00:37:12,006 So you can see that a lot 1055 00:37:12,006 --> 00:37:13,958 of the vectors in these hidden states 1056 00:37:13,958 --> 00:37:15,883 are kind of not very interpretable. 1057 00:37:15,883 --> 00:37:17,404 It seems like they're kind of doing some of this 1058 00:37:17,404 --> 00:37:19,158 low level language modeling 1059 00:37:19,158 --> 00:37:21,396 to figure out what character should come next. 1060 00:37:21,396 --> 00:37:23,438 But some of them end up quite nice. 1061 00:37:23,438 --> 00:37:26,375 So here we found this vector that is looking for quotes. 1062 00:37:26,375 --> 00:37:28,507 You can see that there's this one hidden element, 1063 00:37:28,507 --> 00:37:30,318 this one element in the vector, 1064 00:37:30,318 --> 00:37:32,486 that is off, off, off, off, off blue 1065 00:37:32,486 --> 00:37:33,827 and then once it hits a quote, 1066 00:37:33,827 --> 00:37:37,136 it turns on and remains on for the duration 1067 00:37:37,136 --> 00:37:38,303 of this quote. 1068 00:37:39,187 --> 00:37:41,025 And now when we hit the second quotation mark, 1069 00:37:41,025 --> 00:37:42,598 then that cell turns off. 1070 00:37:42,598 --> 00:37:44,958 So somehow, even though this model was only trained 1071 00:37:44,958 --> 00:37:46,606 to predict the next character in a sequence, 1072 00:37:46,606 --> 00:37:48,856 it somehow learned that a useful thing, 1073 00:37:48,856 --> 00:37:49,976 in order to do this, 1074 00:37:49,976 --> 00:37:54,222 might be to have some cell that's trying to detect quotes. 1075 00:37:54,222 --> 00:37:55,507 We also found this other cell 1076 00:37:55,507 --> 00:37:58,956 that is, looks like it's counting the number of characters 1077 00:37:58,956 --> 00:38:00,104 since a line break. 1078 00:38:00,104 --> 00:38:01,507 So you can see that at the beginning of each line, 1079 00:38:01,507 --> 00:38:04,176 this element starts off at zero. 1080 00:38:04,176 --> 00:38:05,536 Throughout the course of the line, 1081 00:38:05,536 --> 00:38:07,055 it's gradually more red, 1082 00:38:07,055 --> 00:38:08,486 so that value increases. 1083 00:38:08,486 --> 00:38:10,078 And then after the new line character, 1084 00:38:10,078 --> 00:38:11,976 it resets to zero. 1085 00:38:11,976 --> 00:38:13,356 So you can imagine that maybe this cell 1086 00:38:13,356 --> 00:38:15,336 is letting the network keep track 1087 00:38:15,336 --> 00:38:16,966 of when it needs to write 1088 00:38:16,966 --> 00:38:19,545 to produce these new line characters. 1089 00:38:19,545 --> 00:38:20,936 We also found some that, 1090 00:38:20,936 --> 00:38:22,987 when we trained on the linux source code, 1091 00:38:22,987 --> 00:38:25,307 we found some examples that are turning on 1092 00:38:25,307 --> 00:38:27,158 inside the conditions of if statements. 1093 00:38:27,158 --> 00:38:29,216 So this maybe allows the network 1094 00:38:29,216 --> 00:38:32,364 to differentiate whether it's outside an if statement 1095 00:38:32,364 --> 00:38:33,838 or inside that condition, 1096 00:38:33,838 --> 00:38:35,806 which might help it model these sequences better. 1097 00:38:35,806 --> 00:38:38,976 We also found some that turn on in comments, 1098 00:38:38,976 --> 00:38:41,467 or some that seem like they're counting 1099 00:38:41,467 --> 00:38:44,765 the number of indentation levels. 1100 00:38:44,765 --> 00:38:46,298 This is all just really cool stuff 1101 00:38:46,298 --> 00:38:47,488 because it's saying that 1102 00:38:47,488 --> 00:38:49,630 even though we are only trying to train this model 1103 00:38:49,630 --> 00:38:50,758 to predict next characters, 1104 00:38:50,758 --> 00:38:52,229 it somehow ends up learning a lot 1105 00:38:52,229 --> 00:38:55,646 of useful structure about the input data. 1106 00:38:57,161 --> 00:38:59,307 One kind of thing that we often use, 1107 00:38:59,307 --> 00:39:01,718 so this is not really been computer vision so far, 1108 00:39:01,718 --> 00:39:03,878 and we need to pull this back to computer vision 1109 00:39:03,878 --> 00:39:05,528 since this is a vision class. 1110 00:39:05,528 --> 00:39:07,107 We've alluded many times to this 1111 00:39:07,107 --> 00:39:08,747 image captioning model 1112 00:39:08,747 --> 00:39:10,528 where we want to build models that can input 1113 00:39:10,528 --> 00:39:14,376 an image and then output a caption in natural language. 1114 00:39:14,376 --> 00:39:17,408 There were a bunch of papers a couple years ago 1115 00:39:17,408 --> 00:39:19,787 that all had relatively similar approaches. 1116 00:39:19,787 --> 00:39:22,776 But I'm showing the figure from the paper from our lab 1117 00:39:22,776 --> 00:39:25,026 in a totally un-biased way. 1118 00:39:26,876 --> 00:39:29,198 But, the idea here is that the caption 1119 00:39:29,198 --> 00:39:31,648 is this variably length sequence that we might, 1120 00:39:31,648 --> 00:39:34,198 the sequence might have different numbers 1121 00:39:34,198 --> 00:39:35,608 of words for different captions. 1122 00:39:35,608 --> 00:39:37,059 So this is a totally natural fit 1123 00:39:37,059 --> 00:39:39,947 for a recurrent neural network language model. 1124 00:39:39,947 --> 00:39:41,598 So then what this model looks like 1125 00:39:41,598 --> 00:39:43,878 is we have some convolutional network 1126 00:39:43,878 --> 00:39:44,958 which will input the, 1127 00:39:44,958 --> 00:39:46,786 which will take as input the image, 1128 00:39:46,786 --> 00:39:48,397 and we've seen a lot about how 1129 00:39:48,397 --> 00:39:50,379 convolution networks work at this point, 1130 00:39:50,379 --> 00:39:52,753 and that convolutional network will produce 1131 00:39:52,753 --> 00:39:54,569 a summary vector of the image 1132 00:39:54,569 --> 00:39:56,227 which will then feed into the first time step 1133 00:39:56,227 --> 00:39:58,928 of one of these recurrent neural network language models 1134 00:39:58,928 --> 00:40:02,888 which will then produce words of the caption one at a time. 1135 00:40:02,888 --> 00:40:04,945 So the way that this kind of works at test time 1136 00:40:04,945 --> 00:40:06,425 after the model is trained 1137 00:40:06,425 --> 00:40:07,847 looks almost exactly the same 1138 00:40:07,847 --> 00:40:09,665 as these character level language models 1139 00:40:09,665 --> 00:40:11,178 that we saw a little bit ago. 1140 00:40:11,178 --> 00:40:12,699 We'll take our input image, 1141 00:40:12,699 --> 00:40:14,499 feed it through our convolutional network. 1142 00:40:14,499 --> 00:40:16,987 But now instead of taking the softmax scores 1143 00:40:16,987 --> 00:40:18,638 from an image net model, 1144 00:40:18,638 --> 00:40:22,997 we'll instead take this 4,096 dimensional vector 1145 00:40:22,997 --> 00:40:24,915 from the end of the model, 1146 00:40:24,915 --> 00:40:27,998 and we'll take that vector and use it 1147 00:40:29,038 --> 00:40:31,377 to summarize the whole content of the image. 1148 00:40:31,377 --> 00:40:34,208 Now, remember when we talked about RNN language models, 1149 00:40:34,208 --> 00:40:36,379 we said that we need to see the language model 1150 00:40:36,379 --> 00:40:37,947 with that first initial input 1151 00:40:37,947 --> 00:40:39,528 to tell it to start generating text. 1152 00:40:39,528 --> 00:40:42,118 So in this case, we'll give it some special start token, 1153 00:40:42,118 --> 00:40:45,236 which is just saying, hey, this is the start of a sentence. 1154 00:40:45,236 --> 00:40:46,728 Please start generating some text 1155 00:40:46,728 --> 00:40:49,006 conditioned on this image information. 1156 00:40:49,006 --> 00:40:52,338 So now previously, we saw that in this RNN language model, 1157 00:40:52,338 --> 00:40:55,328 we had these matrices that were taking the previous, 1158 00:40:55,328 --> 00:40:57,107 the input at the current time step 1159 00:40:57,107 --> 00:40:58,558 and the hidden state of the previous time step 1160 00:40:58,558 --> 00:41:01,240 and combining those to get the next hidden state. 1161 00:41:01,240 --> 00:41:04,678 Well now, we also need to add in this image information. 1162 00:41:04,678 --> 00:41:06,667 So one way, people play around with exactly 1163 00:41:06,667 --> 00:41:09,288 different ways to incorporate this image information, 1164 00:41:09,288 --> 00:41:10,688 but one simple way is just to add 1165 00:41:10,688 --> 00:41:13,105 a third weight matrix that is 1166 00:41:14,561 --> 00:41:16,779 adding in this image information at every time step 1167 00:41:16,779 --> 00:41:18,598 to compute the next hidden state. 1168 00:41:18,598 --> 00:41:21,145 So now, we'll compute this distribution 1169 00:41:21,145 --> 00:41:23,635 over all scores in our vocabulary 1170 00:41:23,635 --> 00:41:25,056 and here, our vocabulary is something 1171 00:41:25,056 --> 00:41:25,907 like all English words, 1172 00:41:25,907 --> 00:41:27,307 so it could be pretty large. 1173 00:41:27,307 --> 00:41:28,699 We'll sample from that distribution 1174 00:41:28,699 --> 00:41:32,099 and now pass that word back as input at the next time step. 1175 00:41:32,099 --> 00:41:34,488 And that will then feed that word in, 1176 00:41:34,488 --> 00:41:37,418 again get a distribution over all words in the vocab, 1177 00:41:37,418 --> 00:41:39,546 and again sample to produce the next word. 1178 00:41:39,546 --> 00:41:42,248 So then, after that thing is all done, 1179 00:41:42,248 --> 00:41:44,477 we'll maybe generate, 1180 00:41:44,477 --> 00:41:47,538 we'll generate this complete sentence. 1181 00:41:47,538 --> 00:41:50,956 We stop generation once we sample the special ends token, 1182 00:41:50,956 --> 00:41:52,887 which kind of corresponds to the period 1183 00:41:52,887 --> 00:41:54,347 at the end of the sentence. 1184 00:41:54,347 --> 00:41:56,419 Then once the network samples this ends token, 1185 00:41:56,419 --> 00:41:57,859 we stop generation and we're done 1186 00:41:57,859 --> 00:42:00,328 and we've gotten our caption for this image. 1187 00:42:00,328 --> 00:42:01,798 And now, during training, 1188 00:42:01,798 --> 00:42:03,419 we trained this thing to generate, 1189 00:42:03,419 --> 00:42:04,939 like we put an end token at the end 1190 00:42:04,939 --> 00:42:06,739 of every caption during training 1191 00:42:06,739 --> 00:42:08,408 so that the network kind of learned during training 1192 00:42:08,408 --> 00:42:10,907 that end tokens come at the end of sequences. 1193 00:42:10,907 --> 00:42:12,958 So then, during test time, 1194 00:42:12,958 --> 00:42:14,739 it tends to sample these end tokens 1195 00:42:14,739 --> 00:42:16,848 once it's done generating. 1196 00:42:16,848 --> 00:42:18,457 So we trained this model 1197 00:42:18,457 --> 00:42:20,139 in kind of a completely supervised way. 1198 00:42:20,139 --> 00:42:21,547 You can find data sets 1199 00:42:21,547 --> 00:42:24,763 that have images together with natural language captions. 1200 00:42:24,763 --> 00:42:26,805 Microsoft COCO is probably the biggest 1201 00:42:26,805 --> 00:42:28,666 and most widely used for this task. 1202 00:42:28,666 --> 00:42:30,191 But you can just train this model 1203 00:42:30,191 --> 00:42:31,883 in a purely supervised way. 1204 00:42:31,883 --> 00:42:34,935 And then backpropagate through to jointly train 1205 00:42:34,935 --> 00:42:37,653 both this recurrent neural network language model 1206 00:42:37,653 --> 00:42:38,635 and then also pass gradients 1207 00:42:38,635 --> 00:42:40,846 back into this final layer of this the CNN 1208 00:42:40,846 --> 00:42:42,396 and additionally update the weights 1209 00:42:42,396 --> 00:42:45,580 of the CNN to jointly tune all parts of the model 1210 00:42:45,580 --> 00:42:47,837 to perform this task. 1211 00:42:47,837 --> 00:42:49,528 Once you train these models, 1212 00:42:49,528 --> 00:42:52,438 they actually do some pretty reasonable things. 1213 00:42:52,438 --> 00:42:54,997 These are some real results from a model, 1214 00:42:54,997 --> 00:42:56,689 from one of these trained models, 1215 00:42:56,689 --> 00:42:58,226 and it says things like a cat sitting 1216 00:42:58,226 --> 00:43:00,995 on a suitcase on the floor, 1217 00:43:00,995 --> 00:43:02,583 which is pretty impressive. 1218 00:43:02,583 --> 00:43:04,975 It knows about cats sitting on a tree branch, 1219 00:43:04,975 --> 00:43:06,499 which is also pretty cool. 1220 00:43:06,499 --> 00:43:07,796 It knows about two people walking 1221 00:43:07,796 --> 00:43:09,631 on the beach with surfboards. 1222 00:43:09,631 --> 00:43:11,755 So these models are actually pretty powerful 1223 00:43:11,755 --> 00:43:14,642 and can produce relatively complex captions 1224 00:43:14,642 --> 00:43:16,635 to describe the image. 1225 00:43:16,635 --> 00:43:17,630 But that being said, 1226 00:43:17,630 --> 00:43:19,808 these models are really not perfect. 1227 00:43:19,808 --> 00:43:21,227 They're not magical. 1228 00:43:21,227 --> 00:43:24,319 Just like any machine learning model, 1229 00:43:24,319 --> 00:43:25,629 if you try to run them on data 1230 00:43:25,629 --> 00:43:27,458 that was very different from the training data, 1231 00:43:27,458 --> 00:43:28,830 they don't work very well. 1232 00:43:28,830 --> 00:43:30,364 So for example, this example, 1233 00:43:30,364 --> 00:43:33,317 it says a woman is holding a cat in her hand. 1234 00:43:33,317 --> 00:43:35,421 There's clearly no cat in the image. 1235 00:43:35,421 --> 00:43:36,863 But she is wearing a fur coat, 1236 00:43:36,863 --> 00:43:38,106 and maybe the texture of that coat 1237 00:43:38,106 --> 00:43:41,359 kind of looked like a cat to the model. 1238 00:43:41,359 --> 00:43:42,903 Over here, we see a woman standing on a beach 1239 00:43:42,903 --> 00:43:44,379 holding a surfboard. 1240 00:43:44,379 --> 00:43:46,744 Well, she's definitely not holding a surfboard 1241 00:43:46,744 --> 00:43:47,892 and she's doing a handstand, 1242 00:43:47,892 --> 00:43:49,909 which is maybe the interesting part of that image, 1243 00:43:49,909 --> 00:43:51,820 and the model totally missed that. 1244 00:43:51,820 --> 00:43:53,623 Also, over here, we see this example 1245 00:43:53,623 --> 00:43:56,137 where there's this picture of a spider web 1246 00:43:56,137 --> 00:43:57,168 in the tree branch, 1247 00:43:57,168 --> 00:43:58,238 and it totally, 1248 00:43:58,238 --> 00:43:59,071 and it says something like 1249 00:43:59,071 --> 00:44:00,382 a bird sitting on a tree branch. 1250 00:44:00,382 --> 00:44:01,802 So it totally missed the spider, 1251 00:44:01,802 --> 00:44:03,144 but during training, 1252 00:44:03,144 --> 00:44:05,139 it never really saw examples of spiders. 1253 00:44:05,139 --> 00:44:06,722 It just knows that birds sit 1254 00:44:06,722 --> 00:44:08,217 on tree branches during training. 1255 00:44:08,217 --> 00:44:10,152 So it kind of makes these reasonable mistakes. 1256 00:44:10,152 --> 00:44:10,985 Or here at the bottom, 1257 00:44:10,985 --> 00:44:12,155 it can't really tell the difference 1258 00:44:12,155 --> 00:44:14,338 between this guy throwing and catching the ball, 1259 00:44:14,338 --> 00:44:15,712 but it does know that it's a baseball player 1260 00:44:15,712 --> 00:44:17,709 and there's balls and things involved. 1261 00:44:17,709 --> 00:44:19,347 So again, just want to say that these models 1262 00:44:19,347 --> 00:44:20,441 are not perfect. 1263 00:44:20,441 --> 00:44:23,313 They work pretty well when you ask them to caption images 1264 00:44:23,313 --> 00:44:25,109 that were similar to the training data, 1265 00:44:25,109 --> 00:44:26,711 but they definitely have a hard time 1266 00:44:26,711 --> 00:44:29,188 generalizing far beyond that. 1267 00:44:29,188 --> 00:44:30,560 So another thing you'll sometimes see 1268 00:44:30,560 --> 00:44:34,152 is this slightly more advanced model called Attention, 1269 00:44:34,152 --> 00:44:36,447 where now when we're generating the words of this caption, 1270 00:44:36,447 --> 00:44:39,120 we can allow the model to steer it's attention 1271 00:44:39,120 --> 00:44:41,036 to different parts of the image. 1272 00:44:41,036 --> 00:44:44,670 And I don't want to spend too much time on this. 1273 00:44:44,670 --> 00:44:46,859 But the general way that this works is that 1274 00:44:46,859 --> 00:44:48,925 now our convolutional network, 1275 00:44:48,925 --> 00:44:50,664 rather than producing a single vector 1276 00:44:50,664 --> 00:44:52,227 summarizing the entire image, 1277 00:44:52,227 --> 00:44:54,274 now it produces some grid of vectors 1278 00:44:54,274 --> 00:44:55,683 that summarize the, 1279 00:44:55,683 --> 00:44:58,503 that give maybe one vector for each spatial location 1280 00:44:58,503 --> 00:44:59,954 in the image. 1281 00:44:59,954 --> 00:45:00,787 And now, when we, 1282 00:45:00,787 --> 00:45:03,304 when this model runs forward, 1283 00:45:03,304 --> 00:45:06,618 in addition to sampling the vocabulary at every time step, 1284 00:45:06,618 --> 00:45:08,523 it also produces a distribution 1285 00:45:08,523 --> 00:45:09,964 over the locations in the image 1286 00:45:09,964 --> 00:45:11,263 where it wants to look. 1287 00:45:11,263 --> 00:45:13,581 And now this distribution over image locations 1288 00:45:13,581 --> 00:45:15,561 can be seen as a kind of a tension 1289 00:45:15,561 --> 00:45:18,276 of where the model should look during training. 1290 00:45:18,276 --> 00:45:19,397 So now that first hidden state 1291 00:45:19,397 --> 00:45:23,155 computes this distribution over image locations, 1292 00:45:23,155 --> 00:45:26,316 which then goes back to the set of vectors 1293 00:45:26,316 --> 00:45:27,832 to give a single summary vector 1294 00:45:27,832 --> 00:45:31,372 that maybe focuses the attention on one part of that image. 1295 00:45:31,372 --> 00:45:32,795 And now that summary vector gets fed, 1296 00:45:32,795 --> 00:45:34,422 as an additional input, 1297 00:45:34,422 --> 00:45:36,508 at the next time step of the neural network. 1298 00:45:36,508 --> 00:45:38,278 And now again, it will produce two outputs. 1299 00:45:38,278 --> 00:45:40,414 One is our distribution over vocabulary words. 1300 00:45:40,414 --> 00:45:43,714 And the other is a distribution over image locations. 1301 00:45:43,714 --> 00:45:45,402 This whole process will continue, 1302 00:45:45,402 --> 00:45:47,577 and it will sort of do these two different things 1303 00:45:47,577 --> 00:45:49,160 at every time step. 1304 00:45:50,296 --> 00:45:51,390 And after you train the model, 1305 00:45:51,390 --> 00:45:53,905 then you can see that it kind of 1306 00:45:53,905 --> 00:45:56,104 will shift it's attention around the image 1307 00:45:56,104 --> 00:45:58,298 for every word that it generates in the caption. 1308 00:45:58,298 --> 00:45:59,827 Here you can see that it 1309 00:45:59,827 --> 00:46:02,723 produced the caption, a bird is flying over, 1310 00:46:02,723 --> 00:46:04,436 I can't see that far. 1311 00:46:04,436 --> 00:46:06,474 But you can see that its attention 1312 00:46:06,474 --> 00:46:08,556 is shifting around different parts of the image 1313 00:46:08,556 --> 00:46:12,473 for each word in the caption that it generates. 1314 00:46:14,112 --> 00:46:15,417 There's this notion of hard attention 1315 00:46:15,417 --> 00:46:16,544 versus soft attention, 1316 00:46:16,544 --> 00:46:19,147 which I don't really want to get into too much, 1317 00:46:19,147 --> 00:46:21,434 but with this idea of soft attention, 1318 00:46:21,434 --> 00:46:24,405 we're kind of taking a weighted combination 1319 00:46:24,405 --> 00:46:26,614 of all features from all image locations, 1320 00:46:26,614 --> 00:46:28,275 whereas in the hard attention case, 1321 00:46:28,275 --> 00:46:31,004 we're forcing the model to select exactly one location 1322 00:46:31,004 --> 00:46:34,259 to look at in the image at each time step. 1323 00:46:34,259 --> 00:46:35,123 So the hard attention case 1324 00:46:35,123 --> 00:46:36,946 where we're selecting exactly one image location 1325 00:46:36,946 --> 00:46:38,277 is a little bit tricky 1326 00:46:38,277 --> 00:46:41,270 because that is not really a differentiable function, 1327 00:46:41,270 --> 00:46:42,774 so you need to do something slightly fancier 1328 00:46:42,774 --> 00:46:44,292 than vanilla backpropagation 1329 00:46:44,292 --> 00:46:46,247 in order to just train the model in that scenario. 1330 00:46:46,247 --> 00:46:48,145 And I think we'll talk about that a little bit later 1331 00:46:48,145 --> 00:46:51,498 in the lecture on reinforcement learning. 1332 00:46:51,498 --> 00:46:53,577 Now, when you look at after you train 1333 00:46:53,577 --> 00:46:54,750 one of these attention models 1334 00:46:54,750 --> 00:46:57,415 and then run it on to generate captions, 1335 00:46:57,415 --> 00:46:59,326 you can see that it tends to focus it's attention 1336 00:46:59,326 --> 00:47:01,980 on maybe the salient or semanticly meaningful part 1337 00:47:01,980 --> 00:47:04,067 of the image when generating captions. 1338 00:47:04,067 --> 00:47:06,419 You can see that the caption was a woman 1339 00:47:06,419 --> 00:47:08,413 is throwing a frisbee in a park 1340 00:47:08,413 --> 00:47:10,257 and you can see that this attention mask, 1341 00:47:10,257 --> 00:47:11,637 when it generated the word, 1342 00:47:11,637 --> 00:47:13,648 when the model generated the word frisbee, 1343 00:47:13,648 --> 00:47:14,516 at the same time, 1344 00:47:14,516 --> 00:47:17,403 it was focusing it's attention on this image region 1345 00:47:17,403 --> 00:47:18,976 that actually contains the frisbee. 1346 00:47:18,976 --> 00:47:20,644 This is actually really cool. 1347 00:47:20,644 --> 00:47:23,410 We did not tell the model where it should be looking 1348 00:47:23,410 --> 00:47:24,542 at every time step. 1349 00:47:24,542 --> 00:47:26,552 It sort of figured all that out for itself 1350 00:47:26,552 --> 00:47:27,955 during the training process. 1351 00:47:27,955 --> 00:47:29,716 Because somehow, it figured out that looking 1352 00:47:29,716 --> 00:47:31,540 at that image region was the right thing to do 1353 00:47:31,540 --> 00:47:32,721 for this image. 1354 00:47:32,721 --> 00:47:34,610 And because everything in this model is differentiable, 1355 00:47:34,610 --> 00:47:35,998 because we can backpropagate 1356 00:47:35,998 --> 00:47:38,161 through all these soft attention steps, 1357 00:47:38,161 --> 00:47:39,592 all of this soft attention stuff 1358 00:47:39,592 --> 00:47:42,541 just comes out through the training process. 1359 00:47:42,541 --> 00:47:45,297 So that's really, really cool. 1360 00:47:45,297 --> 00:47:47,345 By the way, this idea of recurrent neural networks 1361 00:47:47,345 --> 00:47:49,975 and attention actually gets used in other tasks 1362 00:47:49,975 --> 00:47:51,565 beyond image captioning. 1363 00:47:51,565 --> 00:47:53,169 One recent example is this idea 1364 00:47:53,169 --> 00:47:54,691 of visual question answering. 1365 00:47:54,691 --> 00:47:58,195 So here, our model is going to take two things as input. 1366 00:47:58,195 --> 00:47:59,259 It's going to take an image 1367 00:47:59,259 --> 00:48:01,608 and it will also take a natural language question 1368 00:48:01,608 --> 00:48:04,010 that's asking some question about the image. 1369 00:48:04,010 --> 00:48:06,122 Here, we might see this image on the left 1370 00:48:06,122 --> 00:48:07,536 and we might ask the question, 1371 00:48:07,536 --> 00:48:10,163 what endangered animal is featured on the truck? 1372 00:48:10,163 --> 00:48:11,915 And now the model needs to select from one 1373 00:48:11,915 --> 00:48:13,835 of these four natural language answers 1374 00:48:13,835 --> 00:48:17,327 about which of these answers correctly answers that question 1375 00:48:17,327 --> 00:48:19,027 in the context of the image. 1376 00:48:19,027 --> 00:48:21,586 So you can imagine kind of stitching this model together 1377 00:48:21,586 --> 00:48:24,774 using CNNs and RNNs in kind of a natural way. 1378 00:48:24,774 --> 00:48:26,274 Now, we're in this 1379 00:48:28,125 --> 00:48:29,701 many to one scenario, 1380 00:48:29,701 --> 00:48:31,978 where now our model needs to take as input 1381 00:48:31,978 --> 00:48:33,566 this natural language sequence, 1382 00:48:33,566 --> 00:48:35,552 so we can imagine running a recurrent neural network 1383 00:48:35,552 --> 00:48:37,485 over each element of that input question, 1384 00:48:37,485 --> 00:48:40,111 to now summarize the input question in a single vector. 1385 00:48:40,111 --> 00:48:43,928 And then we can have a CNN to again summarize the image, 1386 00:48:43,928 --> 00:48:46,012 and now combine both the vector from the CNN 1387 00:48:46,012 --> 00:48:49,531 and the vector from the question and coding RNN 1388 00:48:49,531 --> 00:48:51,645 to then predict a distribution over answers. 1389 00:48:51,645 --> 00:48:52,786 We also sometimes, 1390 00:48:52,786 --> 00:48:54,416 you'll also sometimes see this idea 1391 00:48:54,416 --> 00:48:56,275 of soft spacial attention being incorporated 1392 00:48:56,275 --> 00:48:59,175 into things like visual question answering. 1393 00:48:59,175 --> 00:49:00,394 So you can see that here, 1394 00:49:00,394 --> 00:49:03,956 this model is also having the spatial attention 1395 00:49:03,956 --> 00:49:05,342 over the image when it's trying 1396 00:49:05,342 --> 00:49:07,062 to determine answers to the questions. 1397 00:49:07,062 --> 00:49:09,062 Just to, yeah, question? 1398 00:49:18,715 --> 00:49:19,548 So the question is 1399 00:49:19,548 --> 00:49:20,878 How are the different inputs combined? 1400 00:49:20,878 --> 00:49:23,116 Do you mean like the encoded question vector 1401 00:49:23,116 --> 00:49:25,998 and the encoded image vector? 1402 00:49:25,998 --> 00:49:27,188 Yeah, so the question is 1403 00:49:27,188 --> 00:49:28,445 how are the encoded image 1404 00:49:28,445 --> 00:49:30,395 and the encoded question vector combined? 1405 00:49:30,395 --> 00:49:31,756 Kind of the simplest thing to do 1406 00:49:31,756 --> 00:49:32,885 is just to concatenate them 1407 00:49:32,885 --> 00:49:34,847 and stick them into fully connected layers. 1408 00:49:34,847 --> 00:49:35,883 That's probably the most common 1409 00:49:35,883 --> 00:49:37,864 and that's probably the first thing to try. 1410 00:49:37,864 --> 00:49:39,396 Sometimes people do slightly fancier things 1411 00:49:39,396 --> 00:49:41,673 where they might try to have multiplicative interactions 1412 00:49:41,673 --> 00:49:42,885 between those two vectors 1413 00:49:42,885 --> 00:49:44,389 to allow a more powerful function. 1414 00:49:44,389 --> 00:49:46,467 But generally, concatenation is kind of a good 1415 00:49:46,467 --> 00:49:48,050 first thing to try. 1416 00:49:49,426 --> 00:49:51,727 Okay, so now we've talked about a bunch of scenarios 1417 00:49:51,727 --> 00:49:54,083 where RNNs are used for different kinds of problems. 1418 00:49:54,083 --> 00:49:55,084 And I think it's super cool 1419 00:49:55,084 --> 00:49:56,758 because it allows you to 1420 00:49:56,758 --> 00:49:58,855 start tackling really complicated problems 1421 00:49:58,855 --> 00:50:02,386 combining images and computer vision 1422 00:50:02,386 --> 00:50:04,072 with natural language processing. 1423 00:50:04,072 --> 00:50:05,794 And you can see that we can kind of stith together 1424 00:50:05,794 --> 00:50:06,776 these models like Lego blocks 1425 00:50:06,776 --> 00:50:08,870 and attack really complicated things, 1426 00:50:08,870 --> 00:50:11,369 Like image captioning or visual question answering 1427 00:50:11,369 --> 00:50:14,069 just by stitching together these relatively simple 1428 00:50:14,069 --> 00:50:16,736 types of neural network modules. 1429 00:50:18,708 --> 00:50:20,861 But I'd also like to mention that 1430 00:50:20,861 --> 00:50:22,359 so far, we've talked about this idea 1431 00:50:22,359 --> 00:50:24,060 of a single recurrent network layer, 1432 00:50:24,060 --> 00:50:26,274 where we have sort of one hidden state, 1433 00:50:26,274 --> 00:50:29,404 and another thing that you'll see pretty commonly 1434 00:50:29,404 --> 00:50:32,581 is this idea of a multilayer recurrent neural network. 1435 00:50:32,581 --> 00:50:36,577 Here, this is a three layer recurrent neural network, 1436 00:50:36,577 --> 00:50:38,535 so now our input goes in, 1437 00:50:38,535 --> 00:50:42,292 goes into, goes in and produces a sequence of hidden states 1438 00:50:42,292 --> 00:50:44,554 from the first recurrent neural network layer. 1439 00:50:44,554 --> 00:50:46,309 And now, after we run kind of 1440 00:50:46,309 --> 00:50:47,893 one recurrent neural network layer, 1441 00:50:47,893 --> 00:50:50,468 then we have this whole sequence of hidden states. 1442 00:50:50,468 --> 00:50:52,991 And now, we can use the sequence of hidden states 1443 00:50:52,991 --> 00:50:54,733 as an input sequence to another 1444 00:50:54,733 --> 00:50:56,474 recurrent neural network layer. 1445 00:50:56,474 --> 00:50:57,801 And then you can just imagine, 1446 00:50:57,801 --> 00:50:59,757 which will then produce another sequence of hidden states 1447 00:50:59,757 --> 00:51:01,577 from the second RNN layer. 1448 00:51:01,577 --> 00:51:02,410 And then you can just imagine 1449 00:51:02,410 --> 00:51:04,181 stacking these things on top of each other, 1450 00:51:04,181 --> 00:51:06,108 cause we know that we've seen in other contexts 1451 00:51:06,108 --> 00:51:08,253 that deeper models tend to perform better 1452 00:51:08,253 --> 00:51:09,398 for various problems. 1453 00:51:09,398 --> 00:51:11,851 And the same kind of holds in RNNs as well. 1454 00:51:11,851 --> 00:51:14,377 For many problems, you'll see maybe a two 1455 00:51:14,377 --> 00:51:16,714 or three layer recurrent neural network model 1456 00:51:16,714 --> 00:51:18,215 is pretty commonly used. 1457 00:51:18,215 --> 00:51:22,086 You typically don't see super deep models in RNNs. 1458 00:51:22,086 --> 00:51:25,449 So generally, like two, three, four layer RNNs 1459 00:51:25,449 --> 00:51:29,327 is maybe as deep as you'll typically go. 1460 00:51:29,327 --> 00:51:32,231 Then, I think it's also really interesting and important 1461 00:51:32,231 --> 00:51:33,500 to think about, 1462 00:51:33,500 --> 00:51:34,714 now we've seen kind of 1463 00:51:34,714 --> 00:51:38,222 what kinds of problems these RNNs can be used for, 1464 00:51:38,222 --> 00:51:40,079 but then you need to think a little bit more carefully 1465 00:51:40,079 --> 00:51:41,795 about exactly what happens to these models 1466 00:51:41,795 --> 00:51:43,229 when we try to train them. 1467 00:51:43,229 --> 00:51:45,704 So here, I've drawn this little vanilla RNN cell 1468 00:51:45,704 --> 00:51:47,447 that we've talked about so far. 1469 00:51:47,447 --> 00:51:50,639 So here, we're taking our current input, x t, 1470 00:51:50,639 --> 00:51:52,971 and our previous hidden state, h t minus one, 1471 00:51:52,971 --> 00:51:55,307 and then we stack, those are two vectors. 1472 00:51:55,307 --> 00:51:57,359 So we can just stack them together. 1473 00:51:57,359 --> 00:51:59,033 And then perform this matrix multiplication 1474 00:51:59,033 --> 00:52:00,431 with our weight matrix, 1475 00:52:00,431 --> 00:52:01,911 to give our, 1476 00:52:01,911 --> 00:52:03,798 and then squash that output through a tanh, 1477 00:52:03,798 --> 00:52:05,783 and that will give us our next hidden state. 1478 00:52:05,783 --> 00:52:07,687 And that's kind of the basic functional form 1479 00:52:07,687 --> 00:52:10,131 of this vanilla recurrent neural network. 1480 00:52:10,131 --> 00:52:11,347 But then, we need to think about 1481 00:52:11,347 --> 00:52:14,080 what happens in this architecture 1482 00:52:14,080 --> 00:52:17,738 during the backward pass when we try to compute gradients? 1483 00:52:17,738 --> 00:52:19,435 So then if we think about trying to compute, 1484 00:52:19,435 --> 00:52:21,426 so then during the backwards pass, 1485 00:52:21,426 --> 00:52:24,759 we'll receive the derivative of our h t, 1486 00:52:25,850 --> 00:52:27,075 we'll receive derivative of loss 1487 00:52:27,075 --> 00:52:28,568 with respect to h t. 1488 00:52:28,568 --> 00:52:30,885 And during the backward pass through the cell, 1489 00:52:30,885 --> 00:52:32,720 we'll need to compute derivative of loss 1490 00:52:32,720 --> 00:52:34,632 to the respect of h t minus one. 1491 00:52:34,632 --> 00:52:37,026 Then, when we compute this backward pass, 1492 00:52:37,026 --> 00:52:38,601 we see that the gradient flows backward 1493 00:52:38,601 --> 00:52:39,881 through this red path. 1494 00:52:39,881 --> 00:52:41,706 So first, that gradient will flow backwards 1495 00:52:41,706 --> 00:52:43,068 through this tanh gate, 1496 00:52:43,068 --> 00:52:44,036 and then it will flow backwards 1497 00:52:44,036 --> 00:52:45,958 through this matrix multiplication gate. 1498 00:52:45,958 --> 00:52:48,644 And then, as we've seen in the homework 1499 00:52:48,644 --> 00:52:52,054 and when implementing these matrix multiplication layers, 1500 00:52:52,054 --> 00:52:53,252 when you backpropagate through 1501 00:52:53,252 --> 00:52:54,711 this matrix multiplication gate, 1502 00:52:54,711 --> 00:52:56,785 you end up mulitplying by the transpose 1503 00:52:56,785 --> 00:52:58,457 of that weight matrix. 1504 00:52:58,457 --> 00:53:00,642 So that means that every time we backpropagate 1505 00:53:00,642 --> 00:53:03,857 through one of these vanilla RNN cells, 1506 00:53:03,857 --> 00:53:08,024 we end up multiplying by some part of the weight matrix. 1507 00:53:09,110 --> 00:53:11,124 So now if you imagine that we are sticking many 1508 00:53:11,124 --> 00:53:13,834 of these recurrent neural network cells in sequence, 1509 00:53:13,834 --> 00:53:15,413 because again this is an RNN. 1510 00:53:15,413 --> 00:53:16,599 We want a model sequences. 1511 00:53:16,599 --> 00:53:18,976 Now if you imagine what happens to the gradient flow 1512 00:53:18,976 --> 00:53:20,921 through a sequence of these layers, 1513 00:53:20,921 --> 00:53:23,770 then something kind of fishy starts to happen. 1514 00:53:23,770 --> 00:53:25,207 Because now, when we want to compute 1515 00:53:25,207 --> 00:53:27,863 the gradient of the loss with respect to h zero, 1516 00:53:27,863 --> 00:53:29,440 we need to backpropagate through every one 1517 00:53:29,440 --> 00:53:30,810 of these RNN cells. 1518 00:53:30,810 --> 00:53:32,903 And every time you backpropagate through one cell, 1519 00:53:32,903 --> 00:53:35,641 you'll pick up one of these w transpose factors. 1520 00:53:35,641 --> 00:53:38,176 So that means that the final expression 1521 00:53:38,176 --> 00:53:40,289 for the gradient on h zero 1522 00:53:40,289 --> 00:53:42,161 will involve many, many factors 1523 00:53:42,161 --> 00:53:43,550 of this weight matrix, 1524 00:53:43,550 --> 00:53:44,967 which could be kind of bad. 1525 00:53:44,967 --> 00:53:46,976 Maybe don't think about the weight, 1526 00:53:46,976 --> 00:53:48,329 the matrix case, 1527 00:53:48,329 --> 00:53:50,138 but imagine a scaler case. 1528 00:53:50,138 --> 00:53:51,877 If we end up, if we have some scaler 1529 00:53:51,877 --> 00:53:53,662 and we multiply by that same number over and over 1530 00:53:53,662 --> 00:53:54,563 and over again, 1531 00:53:54,563 --> 00:53:56,038 maybe not for four examples, 1532 00:53:56,038 --> 00:53:57,133 but for something like a hundred 1533 00:53:57,133 --> 00:54:00,269 or several hundred time steps, 1534 00:54:00,269 --> 00:54:01,832 then multiplying by the same number 1535 00:54:01,832 --> 00:54:03,835 over and over again is really bad. 1536 00:54:03,835 --> 00:54:05,244 In the scaler case, 1537 00:54:05,244 --> 00:54:06,756 it's either going to explode 1538 00:54:06,756 --> 00:54:08,971 in the case that that number is greater than one 1539 00:54:08,971 --> 00:54:10,504 or it's going to vanish towards zero 1540 00:54:10,504 --> 00:54:12,912 in the case that number is less than one 1541 00:54:12,912 --> 00:54:14,069 in absolute value. 1542 00:54:14,069 --> 00:54:16,355 And the only way in which this will not happen 1543 00:54:16,355 --> 00:54:18,186 is if that number is exactly one, 1544 00:54:18,186 --> 00:54:20,911 which is actually very rare to happen in practice. 1545 00:54:20,911 --> 00:54:22,569 That leaves us to, 1546 00:54:22,569 --> 00:54:25,229 that same intuition extends to the matrix case, 1547 00:54:25,229 --> 00:54:27,560 but now, rather than the absolute value of a scaler number, 1548 00:54:27,560 --> 00:54:29,261 you instead need to look at the largest, 1549 00:54:29,261 --> 00:54:32,071 the largest singular value of this weight matrix. 1550 00:54:32,071 --> 00:54:34,707 Now if that largest singular value is greater than one, 1551 00:54:34,707 --> 00:54:36,502 then during this backward pass, 1552 00:54:36,502 --> 00:54:38,824 when we multiply by the weight matrix over and over, 1553 00:54:38,824 --> 00:54:42,135 that gradient on h w, on h zero, sorry, 1554 00:54:42,135 --> 00:54:43,915 will become very, very large, 1555 00:54:43,915 --> 00:54:46,079 when that matrix is too large. 1556 00:54:46,079 --> 00:54:48,947 And that's something we call the exploding gradient problem. 1557 00:54:48,947 --> 00:54:52,047 Where now this gradient will explode exponentially in depth 1558 00:54:52,047 --> 00:54:53,427 with the number of time steps 1559 00:54:53,427 --> 00:54:55,061 that we backpropagate through. 1560 00:54:55,061 --> 00:54:57,643 And if the largest singular value is less than one, 1561 00:54:57,643 --> 00:54:59,123 then we get the opposite problem, 1562 00:54:59,123 --> 00:55:00,036 where now our gradients will shrink 1563 00:55:00,036 --> 00:55:01,563 and shrink and shrink exponentially, 1564 00:55:01,563 --> 00:55:03,945 as we backpropagate and pick up more and more factors 1565 00:55:03,945 --> 00:55:05,349 of this weight matrix. 1566 00:55:05,349 --> 00:55:08,863 That's called the vanishing gradient problem. 1567 00:55:08,863 --> 00:55:11,172 THere's a bit of a hack that people sometimes do 1568 00:55:11,172 --> 00:55:12,836 to fix the exploding gradient problem 1569 00:55:12,836 --> 00:55:14,208 called gradient clipping, 1570 00:55:14,208 --> 00:55:16,264 which is just this simple heuristic 1571 00:55:16,264 --> 00:55:18,660 saying that after we compute our gradient, 1572 00:55:18,660 --> 00:55:19,713 if that gradient, 1573 00:55:19,713 --> 00:55:22,156 if it's L2 norm is above some threshold, 1574 00:55:22,156 --> 00:55:24,112 then just clamp it down and divide, 1575 00:55:24,112 --> 00:55:27,955 just clamp it down so it has this maximum threshold. 1576 00:55:27,955 --> 00:55:29,271 This is kind of a nasty hack, 1577 00:55:29,271 --> 00:55:30,964 but it actually gets used in practice quite a lot 1578 00:55:30,964 --> 00:55:32,645 when training recurrent neural networks. 1579 00:55:32,645 --> 00:55:34,517 And it's a relatively useful tool 1580 00:55:34,517 --> 00:55:39,034 for attacking this exploding gradient problem. 1581 00:55:39,034 --> 00:55:40,994 But now for the vanishing gradient problem, 1582 00:55:40,994 --> 00:55:42,254 what we typically do 1583 00:55:42,254 --> 00:55:43,640 is we might need to move to 1584 00:55:43,640 --> 00:55:46,207 a more complicated RNN architecture. 1585 00:55:46,207 --> 00:55:48,939 So that motivates this idea of an LSTM. 1586 00:55:48,939 --> 00:55:53,524 An LSTM, which stands for Long Short Term Memory, 1587 00:55:53,524 --> 00:55:55,700 is this slightly fancier recurrence relation 1588 00:55:55,700 --> 00:55:58,316 for these recurrent neural networks. 1589 00:55:58,316 --> 00:55:59,984 It's really designed to help alleviate 1590 00:55:59,984 --> 00:56:03,330 this problem of vanishing and exploding gradients. 1591 00:56:03,330 --> 00:56:05,591 So that rather than kind of hacking on top of it, 1592 00:56:05,591 --> 00:56:07,794 we just kind of design the architecture 1593 00:56:07,794 --> 00:56:10,193 to have better gradient flow properties. 1594 00:56:10,193 --> 00:56:13,566 Kind of an analogy to those fancier CNN architectures 1595 00:56:13,566 --> 00:56:16,556 that we saw at the top of the lecture. 1596 00:56:16,556 --> 00:56:17,389 Another thing to point out 1597 00:56:17,389 --> 00:56:20,807 is that the LSTM cell actually comes from 1997. 1598 00:56:20,807 --> 00:56:22,337 So this idea of an LSTM 1599 00:56:22,337 --> 00:56:24,073 has been around for quite a while, 1600 00:56:24,073 --> 00:56:26,108 and these folks were working on these ideas 1601 00:56:26,108 --> 00:56:27,110 way back in the 90s, 1602 00:56:27,110 --> 00:56:28,852 were definitely ahead of the curve. 1603 00:56:28,852 --> 00:56:31,225 Because these models are kind of used everywhere now 1604 00:56:31,225 --> 00:56:32,475 20 years later. 1605 00:56:33,864 --> 00:56:37,921 And LSTMs kind of have this funny functional form. 1606 00:56:37,921 --> 00:56:39,814 So remember when we had this vanilla 1607 00:56:39,814 --> 00:56:41,198 recurrent neural network, 1608 00:56:41,198 --> 00:56:42,538 it had this hidden state. 1609 00:56:42,538 --> 00:56:44,056 And we used this recurrence relation 1610 00:56:44,056 --> 00:56:46,397 to update the hidden state at every time step. 1611 00:56:46,397 --> 00:56:47,570 Well, now in an LSTM, 1612 00:56:47,570 --> 00:56:48,842 we actually have two, 1613 00:56:48,842 --> 00:56:51,462 we maintain two hidden states at every time step. 1614 00:56:51,462 --> 00:56:52,931 One is this h t, 1615 00:56:52,931 --> 00:56:54,439 which is called the hidden state, 1616 00:56:54,439 --> 00:56:57,334 which is kind of an analogy to the hidden state 1617 00:56:57,334 --> 00:56:59,305 that we had in the vanilla RNN. 1618 00:56:59,305 --> 00:57:02,524 But an LSTM also maintains the second vector, c t, 1619 00:57:02,524 --> 00:57:03,604 called the cell state. 1620 00:57:03,604 --> 00:57:06,459 And the cell state is this vector which is kind of internal, 1621 00:57:06,459 --> 00:57:08,546 kept inside the LSTM, 1622 00:57:08,546 --> 00:57:12,240 and it does not really get exposed to the outside world. 1623 00:57:12,240 --> 00:57:13,170 And we'll see, 1624 00:57:13,170 --> 00:57:15,260 and you can kind of see that through this update equation, 1625 00:57:15,260 --> 00:57:17,371 where you can see that when we, 1626 00:57:17,371 --> 00:57:19,235 first when we compute these, 1627 00:57:19,235 --> 00:57:20,803 we take our two inputs, 1628 00:57:20,803 --> 00:57:23,966 we use them to compute these four gates 1629 00:57:23,966 --> 00:57:25,485 called i, f, o, n, g. 1630 00:57:25,485 --> 00:57:28,634 We use those gates to update our cell states, c t, 1631 00:57:28,634 --> 00:57:30,302 and then we expose part of our cell state 1632 00:57:30,302 --> 00:57:33,802 as the hidden state at the next time step. 1633 00:57:36,704 --> 00:57:38,282 This is kind of a funny functional form, 1634 00:57:38,282 --> 00:57:40,227 and I want to walk through for a couple slides 1635 00:57:40,227 --> 00:57:42,407 exactly why do we use this architecture 1636 00:57:42,407 --> 00:57:44,014 and why does it make sense, 1637 00:57:44,014 --> 00:57:45,418 especially in the context 1638 00:57:45,418 --> 00:57:47,731 of vanishing or exploding gradients. 1639 00:57:47,731 --> 00:57:51,044 This first thing that we do in an LSTM 1640 00:57:51,044 --> 00:57:54,534 is that we're given this previous hidden state, h t, 1641 00:57:54,534 --> 00:57:57,213 and we're given our current input vector, x t, 1642 00:57:57,213 --> 00:57:58,611 and just like the vanilla RNN. 1643 00:57:58,611 --> 00:58:00,251 In the vanilla RNN, remember, 1644 00:58:00,251 --> 00:58:02,617 we took those two input vectors. 1645 00:58:02,617 --> 00:58:03,908 We concatenated them. 1646 00:58:03,908 --> 00:58:05,098 Then we did a matrix multiply 1647 00:58:05,098 --> 00:58:08,663 to directly compute the next hidden state in the RNN. 1648 00:58:08,663 --> 00:58:10,751 Now, the LSTM does something a little bit different. 1649 00:58:10,751 --> 00:58:13,055 We're going to take our previous hidden state 1650 00:58:13,055 --> 00:58:14,429 and our current input, 1651 00:58:14,429 --> 00:58:15,315 stack them, 1652 00:58:15,315 --> 00:58:18,748 and now multiply by a very big weight matrix, w, 1653 00:58:18,748 --> 00:58:21,926 to compute four different gates, 1654 00:58:21,926 --> 00:58:24,391 Which all have the same size as the hidden state. 1655 00:58:24,391 --> 00:58:26,193 Sometimes, you'll see this written in different ways. 1656 00:58:26,193 --> 00:58:29,549 Some authors will write a different weight matrix 1657 00:58:29,549 --> 00:58:30,808 for each gate. 1658 00:58:30,808 --> 00:58:31,793 Some authors will combine them all 1659 00:58:31,793 --> 00:58:33,160 into one big weight matrix. 1660 00:58:33,160 --> 00:58:34,677 But it's all really the same thing. 1661 00:58:34,677 --> 00:58:36,334 The ideas is that we take our hidden state, 1662 00:58:36,334 --> 00:58:37,282 our current input, 1663 00:58:37,282 --> 00:58:40,288 and then we use those to compute these four gates. 1664 00:58:40,288 --> 00:58:42,163 These four gates are the, 1665 00:58:42,163 --> 00:58:45,725 you often see this written as i, f, o, g, ifog, 1666 00:58:45,725 --> 00:58:47,768 which makes it pretty easy to remember what they are. 1667 00:58:47,768 --> 00:58:49,435 I is the input gate. 1668 00:58:50,324 --> 00:58:53,655 It says how much do we want to input into our cell. 1669 00:58:53,655 --> 00:58:55,431 F is the forget gate. 1670 00:58:55,431 --> 00:58:57,860 How much do we want to forget the cell memory 1671 00:58:57,860 --> 00:59:00,194 at the previous, from the previous time step. 1672 00:59:00,194 --> 00:59:01,333 O is the output gate, 1673 00:59:01,333 --> 00:59:03,327 which is how much do we want to reveal ourself 1674 00:59:03,327 --> 00:59:04,653 to the outside world. 1675 00:59:04,653 --> 00:59:07,092 And G really doesn't have a nice name, 1676 00:59:07,092 --> 00:59:10,130 so I usually call it the gate gate. 1677 00:59:10,130 --> 00:59:13,109 G, it tells us how much do we want to write 1678 00:59:13,109 --> 00:59:14,626 into our input cell. 1679 00:59:14,626 --> 00:59:16,665 And then you notice that each of these four gates 1680 00:59:16,665 --> 00:59:19,665 are using a different non linearity. 1681 00:59:21,724 --> 00:59:23,809 The input, forget and output gate 1682 00:59:23,809 --> 00:59:25,047 are all using sigmoids, 1683 00:59:25,047 --> 00:59:28,571 which means that their values will be between zero and one. 1684 00:59:28,571 --> 00:59:31,109 Whereas the gate gate uses a tanh, 1685 00:59:31,109 --> 00:59:34,316 which means it's output will be between minus one and one. 1686 00:59:34,316 --> 00:59:36,810 So, these are kind of weird, 1687 00:59:36,810 --> 00:59:38,551 but it makes a little bit more sense 1688 00:59:38,551 --> 00:59:41,725 if you imagine them all as binary values. 1689 00:59:41,725 --> 00:59:43,077 Right, like what happens at the extremes 1690 00:59:43,077 --> 00:59:44,744 of these two values? 1691 00:59:46,033 --> 00:59:46,866 It's kind of what happens, 1692 00:59:46,866 --> 00:59:48,490 if you look after we compute these gates 1693 00:59:48,490 --> 00:59:50,200 if you look at this next equation, 1694 00:59:50,200 --> 00:59:51,423 you can see that our cell state 1695 00:59:51,423 --> 00:59:53,926 is being multiplied element wise by the forget gate. 1696 00:59:53,926 --> 00:59:56,269 Sorry, our cell state from the previous time step 1697 00:59:56,269 --> 00:59:59,305 is being multiplied element wise by this forget gate. 1698 00:59:59,305 --> 01:00:00,350 And now if this forget gate, 1699 01:00:00,350 --> 01:00:03,861 you can think of it as being a vector of zeros and ones, 1700 01:00:03,861 --> 01:00:06,162 that's telling us for each element in the cell state, 1701 01:00:06,162 --> 01:00:08,416 do we want to forget that element of the cell 1702 01:00:08,416 --> 01:00:10,644 in the case if the forget gate was zero? 1703 01:00:10,644 --> 01:00:12,819 Or do we want to remember that element of the cell 1704 01:00:12,819 --> 01:00:14,935 in the case if the forget gate was one. 1705 01:00:14,935 --> 01:00:17,258 Now, once we've used the forget gate 1706 01:00:17,258 --> 01:00:19,767 to gate off the part of the cell state, 1707 01:00:19,767 --> 01:00:21,088 then we have the second term, 1708 01:00:21,088 --> 01:00:24,814 which is the element wise product of i and g. 1709 01:00:24,814 --> 01:00:27,431 So now, i is this vector of zeros and ones, 1710 01:00:27,431 --> 01:00:28,824 cause it's coming through a sigmoid, 1711 01:00:28,824 --> 01:00:31,480 telling us for each element of the cell state, 1712 01:00:31,480 --> 01:00:33,909 do we want to write to that element of the cell state 1713 01:00:33,909 --> 01:00:35,728 in the case that i is one, 1714 01:00:35,728 --> 01:00:37,771 or do we not want to write to that element of the cell state 1715 01:00:37,771 --> 01:00:38,687 at this time step 1716 01:00:38,687 --> 01:00:41,719 in the case that i is zero. 1717 01:00:41,719 --> 01:00:42,585 And now the gate gate, 1718 01:00:42,585 --> 01:00:44,254 because it's coming through a tanh, 1719 01:00:44,254 --> 01:00:46,031 will be either one or minus one. 1720 01:00:46,031 --> 01:00:47,495 So that is the value that we want, 1721 01:00:47,495 --> 01:00:50,500 the candidate value that we might consider writing 1722 01:00:50,500 --> 01:00:54,022 to each element of the cell state at this time step. 1723 01:00:54,022 --> 01:00:56,098 Then if you look at the cell state equation, 1724 01:00:56,098 --> 01:00:58,157 you can see that at every time step, 1725 01:00:58,157 --> 01:00:59,963 the cell state has these kind of 1726 01:00:59,963 --> 01:01:02,499 these different, independent scaler values, 1727 01:01:02,499 --> 01:01:05,230 and they're all being incremented or decremented by one. 1728 01:01:05,230 --> 01:01:07,466 So there's kind of like, 1729 01:01:07,466 --> 01:01:09,560 inside the cell state, we can either remember 1730 01:01:09,560 --> 01:01:11,028 or forget our previous state, 1731 01:01:11,028 --> 01:01:13,480 and then we can either increment or decrement 1732 01:01:13,480 --> 01:01:14,765 each element of that cell state 1733 01:01:14,765 --> 01:01:16,535 by up to one at each time step. 1734 01:01:16,535 --> 01:01:19,430 So you can kind of think of these elements of the cell state 1735 01:01:19,430 --> 01:01:22,189 as being little scaler integer counters 1736 01:01:22,189 --> 01:01:24,061 that can be incremented and decremented 1737 01:01:24,061 --> 01:01:25,708 at each time step. 1738 01:01:25,708 --> 01:01:28,489 And now, after we've computed our cell state, 1739 01:01:28,489 --> 01:01:31,624 then we use our now updated cell state 1740 01:01:31,624 --> 01:01:32,874 to compute a hidden state, 1741 01:01:32,874 --> 01:01:36,513 which we will reveal to the outside world. 1742 01:01:36,513 --> 01:01:38,832 So because this cell state has this interpretation 1743 01:01:38,832 --> 01:01:39,884 of being counters, 1744 01:01:39,884 --> 01:01:41,280 and sort of counting up by one 1745 01:01:41,280 --> 01:01:43,353 or minus one at each time step, 1746 01:01:43,353 --> 01:01:45,518 we want to squash that counter value 1747 01:01:45,518 --> 01:01:48,895 into a nice zero to one range using a tanh. 1748 01:01:48,895 --> 01:01:50,483 And now, we multiply element wise, 1749 01:01:50,483 --> 01:01:51,826 by this output gate. 1750 01:01:51,826 --> 01:01:54,441 And the output gate is again coming through a sigmoid, 1751 01:01:54,441 --> 01:01:57,597 so you can think of it as being mostly zeros and ones, 1752 01:01:57,597 --> 01:01:58,947 and the output gate tells us 1753 01:01:58,947 --> 01:02:00,826 for each element of our cell state, 1754 01:02:00,826 --> 01:02:02,773 do we want to reveal or not reveal 1755 01:02:02,773 --> 01:02:04,614 that element of our cell state 1756 01:02:04,614 --> 01:02:06,994 when we're computing the external hidden state 1757 01:02:06,994 --> 01:02:08,577 for this time step. 1758 01:02:09,736 --> 01:02:11,609 And then, I think there's kind of a tradition 1759 01:02:11,609 --> 01:02:13,479 in people trying to explain LSTMs, 1760 01:02:13,479 --> 01:02:14,524 that everyone needs to come up 1761 01:02:14,524 --> 01:02:17,132 with their own potentially confusing LSTM diagram. 1762 01:02:17,132 --> 01:02:18,882 So here's my attempt. 1763 01:02:20,380 --> 01:02:23,866 Here, we can see what's going on inside this LSTM cell, 1764 01:02:23,866 --> 01:02:24,865 is that we take our, 1765 01:02:24,865 --> 01:02:27,566 we're taking as input on the left our previous cell state 1766 01:02:27,566 --> 01:02:28,937 and the previous hidden state, 1767 01:02:28,937 --> 01:02:31,266 as well as our current input, x t. 1768 01:02:31,266 --> 01:02:32,537 Now we're going to take our current, 1769 01:02:32,537 --> 01:02:34,985 our previous hidden state, 1770 01:02:34,985 --> 01:02:36,453 as well as our current input, 1771 01:02:36,453 --> 01:02:37,346 stack them, 1772 01:02:37,346 --> 01:02:39,526 and then multiply with this weight matrix, w, 1773 01:02:39,526 --> 01:02:41,166 to produce our four gates. 1774 01:02:41,166 --> 01:02:42,627 And here, I've left out the non linearities 1775 01:02:42,627 --> 01:02:44,836 because we saw those on a previous slide. 1776 01:02:44,836 --> 01:02:47,294 And now the forget gate multiplies element wise 1777 01:02:47,294 --> 01:02:48,143 with the cell state. 1778 01:02:48,143 --> 01:02:51,174 The input and gate gate are multiplied element wise 1779 01:02:51,174 --> 01:02:52,689 and added to the cell state. 1780 01:02:52,689 --> 01:02:54,524 And that gives us our next cell. 1781 01:02:54,524 --> 01:02:56,533 The next cell gets squashed through a tanh, 1782 01:02:56,533 --> 01:02:58,616 and multiplied element wise with this output gate 1783 01:02:58,616 --> 01:03:01,366 to produce our next hidden state. 1784 01:03:02,417 --> 01:03:03,250 Question? 1785 01:03:13,116 --> 01:03:14,587 No, So they're coming through this, 1786 01:03:14,587 --> 01:03:17,878 they're coming from different parts of this weight matrix. 1787 01:03:17,878 --> 01:03:19,147 So if our hidden, 1788 01:03:19,147 --> 01:03:23,343 if our x and our h all have this dimension h, 1789 01:03:23,343 --> 01:03:24,300 then after we stack them, 1790 01:03:24,300 --> 01:03:26,415 they'll be a vector size two h, 1791 01:03:26,415 --> 01:03:28,718 and now our weight matrix will be this matrix 1792 01:03:28,718 --> 01:03:30,393 of size four h times two h. 1793 01:03:30,393 --> 01:03:31,873 So you can think of that as sort of having 1794 01:03:31,873 --> 01:03:34,076 four chunks of this weight matrix. 1795 01:03:34,076 --> 01:03:37,344 And each of these four chunks of the weight matrix 1796 01:03:37,344 --> 01:03:41,511 is going to compute a different one of these gates. 1797 01:03:42,404 --> 01:03:44,673 You'll often see this written for clarity, 1798 01:03:44,673 --> 01:03:46,449 kind of combining all four of those different 1799 01:03:46,449 --> 01:03:49,161 weight matrices into a single large matrix, w, 1800 01:03:49,161 --> 01:03:51,109 just for notational convenience. 1801 01:03:51,109 --> 01:03:52,484 But they're all computed 1802 01:03:52,484 --> 01:03:56,067 using different parts of the weight matrix. 1803 01:03:57,080 --> 01:03:58,645 But you're correct in that they're all computed 1804 01:03:58,645 --> 01:04:00,458 using the same functional form 1805 01:04:00,458 --> 01:04:01,658 of just stacking the two things 1806 01:04:01,658 --> 01:04:04,574 and taking the matrix multiplication. 1807 01:04:04,574 --> 01:04:06,393 Now that we have this picture, 1808 01:04:06,393 --> 01:04:09,519 we can think about what happens to an LSTM cell 1809 01:04:09,519 --> 01:04:11,196 during the backwards pass? 1810 01:04:11,196 --> 01:04:12,795 We saw, in the context of vanilla 1811 01:04:12,795 --> 01:04:13,738 recurrent neural network, 1812 01:04:13,738 --> 01:04:15,803 that some bad things happened during the backwards pass, 1813 01:04:15,803 --> 01:04:16,958 where we were continually multiplying 1814 01:04:16,958 --> 01:04:18,999 by that weight matrix, w. 1815 01:04:18,999 --> 01:04:20,555 But now, the situation looks much, 1816 01:04:20,555 --> 01:04:23,238 quite a bit different in the LSTM. 1817 01:04:23,238 --> 01:04:26,468 If you imagine this path backwards 1818 01:04:26,468 --> 01:04:28,323 of computing the gradients of the cell state, 1819 01:04:28,323 --> 01:04:29,952 we get quite a nice picture. 1820 01:04:29,952 --> 01:04:32,266 Now, when we have our upstream gradient 1821 01:04:32,266 --> 01:04:33,320 from the cell coming in, 1822 01:04:33,320 --> 01:04:36,281 then once we backpropagate backwards 1823 01:04:36,281 --> 01:04:37,737 through this addition operation, 1824 01:04:37,737 --> 01:04:40,572 remember that this addition just copies 1825 01:04:40,572 --> 01:04:43,663 that upstream gradient into the two branches, 1826 01:04:43,663 --> 01:04:45,641 so our upstream gradient gets copied directly 1827 01:04:45,641 --> 01:04:47,954 and passed directly to backpropagating 1828 01:04:47,954 --> 01:04:50,435 through this element wise multiply. 1829 01:04:50,435 --> 01:04:52,192 So then our upstream gradient ends up getting 1830 01:04:52,192 --> 01:04:56,455 multiplied element wise by the forget gate. 1831 01:04:56,455 --> 01:05:00,364 As we backpropagate backwards through this cell state, 1832 01:05:00,364 --> 01:05:01,397 the only thing that happens 1833 01:05:01,397 --> 01:05:03,818 to our upstream cell state gradient 1834 01:05:03,818 --> 01:05:05,935 is that it ends up getting multiplied element wise 1835 01:05:05,935 --> 01:05:07,171 by the forget gate. 1836 01:05:07,171 --> 01:05:09,939 This is really a lot nicer 1837 01:05:09,939 --> 01:05:12,640 than the vanilla RNN for two reasons. 1838 01:05:12,640 --> 01:05:14,318 One is that this forget gate 1839 01:05:14,318 --> 01:05:16,488 is now an element wise multiplication 1840 01:05:16,488 --> 01:05:18,498 rather than a full matrix multiplication. 1841 01:05:18,498 --> 01:05:19,923 So element wise multiplication 1842 01:05:19,923 --> 01:05:23,205 is going to be a little bit nicer 1843 01:05:23,205 --> 01:05:24,964 than full matrix multiplication. 1844 01:05:24,964 --> 01:05:27,208 Second is that element wise multiplication 1845 01:05:27,208 --> 01:05:29,710 will potentially be multiplying by a different 1846 01:05:29,710 --> 01:05:31,354 forget gate at every time step. 1847 01:05:31,354 --> 01:05:33,087 So remember, in the vanilla RNN, 1848 01:05:33,087 --> 01:05:35,638 we were continually multiplying by that same weight matrix 1849 01:05:35,638 --> 01:05:36,660 over and over again, 1850 01:05:36,660 --> 01:05:38,305 which led very explicitly 1851 01:05:38,305 --> 01:05:40,563 to these exploding or vanishing gradients. 1852 01:05:40,563 --> 01:05:41,943 But now in the LSTM case, 1853 01:05:41,943 --> 01:05:45,161 this forget gate can vary from each time step. 1854 01:05:45,161 --> 01:05:47,463 Now, it's much easier for the model 1855 01:05:47,463 --> 01:05:49,560 to avoid these problems 1856 01:05:49,560 --> 01:05:51,670 of exploding and vanishing gradients. 1857 01:05:51,670 --> 01:05:53,377 Finally, because this forget gate 1858 01:05:53,377 --> 01:05:54,902 is coming out from a sigmoid, 1859 01:05:54,902 --> 01:05:56,178 this element wise multiply 1860 01:05:56,178 --> 01:05:58,438 is guaranteed to be between zero and one, 1861 01:05:58,438 --> 01:06:00,868 which again, leads to sort of nicer numerical properties 1862 01:06:00,868 --> 01:06:02,457 if you imagine multiplying by these things 1863 01:06:02,457 --> 01:06:04,278 over and over again. 1864 01:06:04,278 --> 01:06:07,063 Another thing to notice is that in the context 1865 01:06:07,063 --> 01:06:08,909 of the vanilla recurrent neural network, 1866 01:06:08,909 --> 01:06:10,239 we saw that during the backward pass, 1867 01:06:10,239 --> 01:06:13,146 our gradients were flowing through also a tanh 1868 01:06:13,146 --> 01:06:14,273 at every time step. 1869 01:06:14,273 --> 01:06:15,940 But now, in an LSTM, 1870 01:06:17,459 --> 01:06:18,792 our outputs are, 1871 01:06:20,790 --> 01:06:22,452 in an LSTM, our hidden state is used 1872 01:06:22,452 --> 01:06:24,110 to compute those outputs, y t, 1873 01:06:24,110 --> 01:06:26,141 so now, each hidden state, 1874 01:06:26,141 --> 01:06:29,229 if you imagine backpropagating from the final hidden state 1875 01:06:29,229 --> 01:06:31,290 back to the first cell state, 1876 01:06:31,290 --> 01:06:33,024 then through that backward path, 1877 01:06:33,024 --> 01:06:37,396 we only backpropagate through a single tanh non linearity 1878 01:06:37,396 --> 01:06:42,044 rather than through a separate tanh at every time step. 1879 01:06:42,044 --> 01:06:44,059 So kind of when you put all these things together, 1880 01:06:44,059 --> 01:06:45,927 you can see this backwards pass 1881 01:06:45,927 --> 01:06:48,646 backpropagating through the cell state 1882 01:06:48,646 --> 01:06:50,483 is kind of a gradient super highway 1883 01:06:50,483 --> 01:06:53,205 that lets gradients pass relatively unimpeded 1884 01:06:53,205 --> 01:06:55,047 from the loss at the very end of the model 1885 01:06:55,047 --> 01:06:56,765 all the way back to the initial cell state 1886 01:06:56,765 --> 01:06:59,172 at the beginning of the model. 1887 01:06:59,172 --> 01:07:00,922 Was there a question? 1888 01:07:02,901 --> 01:07:04,693 Yeah, what about the gradient in respect to w? 1889 01:07:04,693 --> 01:07:06,792 'Cause that's ultimately the thing that we care about. 1890 01:07:06,792 --> 01:07:08,752 So, the gradient with respect to w 1891 01:07:08,752 --> 01:07:10,252 will come through, 1892 01:07:11,737 --> 01:07:12,570 at every time step, 1893 01:07:12,570 --> 01:07:13,771 will take our current cell state 1894 01:07:13,771 --> 01:07:15,059 as well as our current hidden state 1895 01:07:15,059 --> 01:07:16,272 and that will give us an element, 1896 01:07:16,272 --> 01:07:18,480 that will give us our local gradient on w 1897 01:07:18,480 --> 01:07:19,848 for that time step. 1898 01:07:19,848 --> 01:07:21,340 So because our cell state, 1899 01:07:21,340 --> 01:07:23,791 and just in the vanilla RNN case, 1900 01:07:23,791 --> 01:07:27,307 we'll end up adding those first time step w gradients 1901 01:07:27,307 --> 01:07:29,587 to compute our final gradient on w. 1902 01:07:29,587 --> 01:07:33,139 But now, if you imagine the situation 1903 01:07:33,139 --> 01:07:34,961 where we have a very long sequence, 1904 01:07:34,961 --> 01:07:36,405 and we're only getting gradients to the very end 1905 01:07:36,405 --> 01:07:37,280 of the sequence. 1906 01:07:37,280 --> 01:07:38,945 Now, as you backpropagate through, 1907 01:07:38,945 --> 01:07:40,978 we'll get a local gradient on w 1908 01:07:40,978 --> 01:07:43,219 for each time step, 1909 01:07:43,219 --> 01:07:44,484 and that local gradient on w 1910 01:07:44,484 --> 01:07:48,506 will be coming through these gradients on c and h. 1911 01:07:48,506 --> 01:07:50,994 So because we're maintaining the gradients on c 1912 01:07:50,994 --> 01:07:52,751 much more nicely in the LSTM case, 1913 01:07:52,751 --> 01:07:54,988 those local gradients on w at each time step 1914 01:07:54,988 --> 01:07:57,221 will also be carried forward and backward 1915 01:07:57,221 --> 01:07:59,804 through time much more cleanly. 1916 01:08:01,627 --> 01:08:03,044 Another question? 1917 01:08:17,428 --> 01:08:18,645 Yeah, so the question is 1918 01:08:18,645 --> 01:08:19,886 due to the non linearities, 1919 01:08:19,886 --> 01:08:22,088 could this still be susceptible to vanishing gradients? 1920 01:08:22,089 --> 01:08:24,077 And that could be the case. 1921 01:08:24,077 --> 01:08:26,176 Actually, so one problem you might imagine 1922 01:08:26,176 --> 01:08:27,560 is that maybe if these forget gates 1923 01:08:27,560 --> 01:08:29,322 are always less than zero, 1924 01:08:29,323 --> 01:08:30,252 or always less than one, 1925 01:08:30,252 --> 01:08:31,411 you might get vanishing gradients 1926 01:08:31,411 --> 01:08:34,103 as you continually go through these forget gates. 1927 01:08:34,103 --> 01:08:35,960 Well, one sort of trick that people do in practice 1928 01:08:35,960 --> 01:08:38,513 is that they will, sometimes, 1929 01:08:38,513 --> 01:08:40,689 initialize the biases of the forget gate 1930 01:08:40,689 --> 01:08:42,746 to be somewhat positive. 1931 01:08:42,746 --> 01:08:44,004 So that at the beginning of training, 1932 01:08:44,004 --> 01:08:46,305 those forget gates are always very close to one. 1933 01:08:46,305 --> 01:08:48,118 So that at least at the beginning of training, 1934 01:08:48,118 --> 01:08:52,285 then we have not so, relatively clean gradient flow 1935 01:08:53,265 --> 01:08:54,381 through these forget gates, 1936 01:08:54,381 --> 01:08:56,631 since they're all initialized to be near one. 1937 01:08:56,631 --> 01:08:58,934 And then throughout the course of training, 1938 01:08:58,935 --> 01:09:00,308 then the model can learn those biases 1939 01:09:00,308 --> 01:09:02,742 and kind of learn to forget where it needs to. 1940 01:09:02,742 --> 01:09:04,886 You're right that there still could be some potential 1941 01:09:04,886 --> 01:09:06,246 for vanishing gradients here. 1942 01:09:06,246 --> 01:09:07,569 But it's much less extreme 1943 01:09:07,569 --> 01:09:09,182 than the vanilla RNN case, 1944 01:09:09,182 --> 01:09:12,126 both because those fs can vary at each time step, 1945 01:09:12,126 --> 01:09:14,590 and also because we're doing 1946 01:09:14,590 --> 01:09:15,620 this element wise multiplication 1947 01:09:15,620 --> 01:09:19,126 rather than a full matrix multiplication. 1948 01:09:19,126 --> 01:09:20,801 So you can see that this LSTM 1949 01:09:20,801 --> 01:09:23,048 actually looks quite similar to ResNet. 1950 01:09:23,048 --> 01:09:24,510 In this residual network, 1951 01:09:24,510 --> 01:09:26,732 we had this path of identity connections 1952 01:09:26,732 --> 01:09:28,069 going backward through the network 1953 01:09:28,069 --> 01:09:30,362 and that gave, sort of a gradient super highway 1954 01:09:30,363 --> 01:09:32,303 for gradients to flow backward in ResNet. 1955 01:09:32,303 --> 01:09:34,924 And now it's kind of the same intuition in LSTM 1956 01:09:34,924 --> 01:09:37,944 where these additive and element wise 1957 01:09:37,944 --> 01:09:40,226 multiplicative interactions of the cell state 1958 01:09:40,227 --> 01:09:42,305 can give a similar gradient super highway 1959 01:09:42,305 --> 01:09:44,328 for gradients to flow backwards through the cell state 1960 01:09:44,328 --> 01:09:45,245 in an LSTM. 1961 01:09:46,343 --> 01:09:48,593 And by the way, there's this other kind of nice paper 1962 01:09:48,593 --> 01:09:49,682 called highway networks, 1963 01:09:49,682 --> 01:09:51,808 which is kind of in between this idea 1964 01:09:51,808 --> 01:09:53,225 of this LSTM cell 1965 01:09:54,138 --> 01:09:56,471 and these residual networks. 1966 01:09:57,796 --> 01:09:58,697 So these highway networks 1967 01:09:58,697 --> 01:10:01,165 actually came before residual networks, 1968 01:10:01,165 --> 01:10:03,008 and they had this idea where 1969 01:10:03,008 --> 01:10:04,541 at every layer of the highway network, 1970 01:10:04,541 --> 01:10:05,984 we're going to compute 1971 01:10:05,984 --> 01:10:07,373 sort of a candidate activation, 1972 01:10:07,373 --> 01:10:08,804 as well as a gating function 1973 01:10:08,804 --> 01:10:10,792 that tells us that interprelates 1974 01:10:10,792 --> 01:10:13,045 between our previous input at that layer, 1975 01:10:13,045 --> 01:10:14,676 and that candidate activation 1976 01:10:14,676 --> 01:10:17,017 that came through our convolutions or what not. 1977 01:10:17,017 --> 01:10:19,455 So there's actually a lot of architectural similarities 1978 01:10:19,455 --> 01:10:20,472 between these things, 1979 01:10:20,472 --> 01:10:21,821 and people take a lot of inspiration 1980 01:10:21,821 --> 01:10:24,363 from training very deep CNNs 1981 01:10:24,363 --> 01:10:25,196 and very deep RNNs 1982 01:10:25,196 --> 01:10:27,688 and there's a lot of crossover here. 1983 01:10:27,688 --> 01:10:31,682 Very briefly, you'll see a lot of other types of variance 1984 01:10:31,682 --> 01:10:33,777 of recurrent neural network architectures out there 1985 01:10:33,777 --> 01:10:34,675 in the wild. 1986 01:10:34,675 --> 01:10:36,760 Probably the most common, apart from the LSTM, 1987 01:10:36,760 --> 01:10:40,853 is this GRU, called the gated recurrent unit. 1988 01:10:40,853 --> 01:10:42,665 And you can see those update equations here, 1989 01:10:42,665 --> 01:10:45,701 and it kind of has this similar flavor of the LSTM, 1990 01:10:45,701 --> 01:10:49,318 where it uses these multiplicative element wise gates 1991 01:10:49,318 --> 01:10:51,417 together with these additive interactions 1992 01:10:51,417 --> 01:10:53,828 to avoid this vanishing gradient problem. 1993 01:10:53,828 --> 01:10:55,584 There's also this cool paper 1994 01:10:55,584 --> 01:10:57,679 called LSTM: a search based oddysey, 1995 01:10:57,679 --> 01:10:59,734 very inventive title, 1996 01:10:59,734 --> 01:11:02,853 where they tried to play around with the LSTM equations 1997 01:11:02,853 --> 01:11:04,980 and swap out the non linearities at one point, 1998 01:11:04,980 --> 01:11:06,329 like do we really need that tanh 1999 01:11:06,329 --> 01:11:07,343 for exposing the output gate, 2000 01:11:07,343 --> 01:11:09,756 and they tried to answer a lot of these different questions 2001 01:11:09,756 --> 01:11:11,367 about each of those non linearities, 2002 01:11:11,367 --> 01:11:14,017 each of those pieces of the LSTM update equations. 2003 01:11:14,017 --> 01:11:15,374 What happens if we change the model 2004 01:11:15,374 --> 01:11:18,068 and tweak those LSTM equations a little bit. 2005 01:11:18,068 --> 01:11:19,038 And kind of the conclusion is 2006 01:11:19,038 --> 01:11:20,260 that they all work about the same 2007 01:11:20,260 --> 01:11:22,414 Some of them work a little bit better than others 2008 01:11:22,414 --> 01:11:24,521 for one problem or another. 2009 01:11:24,521 --> 01:11:25,735 But generally, none of the things, 2010 01:11:25,735 --> 01:11:28,036 none of the tweaks of LSTM that they tried 2011 01:11:28,036 --> 01:11:30,911 were significantly better that the original LSTM 2012 01:11:30,911 --> 01:11:32,148 for all problems. 2013 01:11:32,148 --> 01:11:33,647 So that gives you a little bit more faith 2014 01:11:33,647 --> 01:11:36,505 that the LSTM update equations seem kind of magical 2015 01:11:36,505 --> 01:11:38,889 but they're useful anyway. 2016 01:11:38,889 --> 01:11:41,084 You should probably consider them for your problem. 2017 01:11:41,084 --> 01:11:43,692 There's also this cool paper from Google a couple years ago 2018 01:11:43,692 --> 01:11:44,575 where they tried to use, 2019 01:11:44,575 --> 01:11:46,520 where they did kind of an evolutionary search 2020 01:11:46,520 --> 01:11:48,117 and did a search over many, 2021 01:11:48,117 --> 01:11:52,605 over a very large number of random RNN architectures, 2022 01:11:52,605 --> 01:11:55,694 they kind of randomly premute these update equations 2023 01:11:55,694 --> 01:11:57,936 and try putting the additions and the multiplications 2024 01:11:57,936 --> 01:11:59,332 and the gates and the non linearities 2025 01:11:59,332 --> 01:12:00,860 in different kinds of combinations. 2026 01:12:00,860 --> 01:12:03,288 They blasted this out over their huge Google cluster 2027 01:12:03,288 --> 01:12:04,351 and just tried a whole bunch 2028 01:12:04,351 --> 01:12:08,570 of these different weigh updates in various flavors. 2029 01:12:08,570 --> 01:12:09,742 And again, it was the same story 2030 01:12:09,742 --> 01:12:11,299 that they didn't really find anything 2031 01:12:11,299 --> 01:12:12,607 that was significantly better 2032 01:12:12,607 --> 01:12:15,446 than these existing GRU or LSTM styles. 2033 01:12:15,446 --> 01:12:16,978 Although there were some variations that worked 2034 01:12:16,978 --> 01:12:19,518 maybe slightly better or worse for certain problems. 2035 01:12:19,518 --> 01:12:21,304 But kind of the take away is that 2036 01:12:21,304 --> 01:12:24,252 probably and using an LSTM or GRU 2037 01:12:24,252 --> 01:12:27,080 is not so much magic in those equations, 2038 01:12:27,080 --> 01:12:29,292 but this idea of managing gradient flow properly 2039 01:12:29,292 --> 01:12:30,633 through these additive connections 2040 01:12:30,633 --> 01:12:32,023 and these multiplicative gates 2041 01:12:32,023 --> 01:12:33,356 is super useful. 2042 01:12:34,888 --> 01:12:37,286 So yeah, the summary is that RNNs are super cool. 2043 01:12:37,286 --> 01:12:40,103 They can allow you to attack tons of new types of problems. 2044 01:12:40,103 --> 01:12:42,024 They sometimes are susceptible to vanishing 2045 01:12:42,024 --> 01:12:43,431 or exploding gradients. 2046 01:12:43,431 --> 01:12:44,740 But we can address that with weight clipping 2047 01:12:44,740 --> 01:12:47,412 and with fancier architectures. 2048 01:12:47,412 --> 01:12:48,899 And there's a lot of cool overlap 2049 01:12:48,899 --> 01:12:52,043 between CNN architectures and RNN architectures. 2050 01:12:52,043 --> 01:12:53,856 So next time, you'll be taking the midterm. 2051 01:12:53,856 --> 01:12:57,856 But after that, we'll have a, sorry, a question? 2052 01:12:59,525 --> 01:13:00,801 Midterm is after this lecture 2053 01:13:00,801 --> 01:13:04,142 so anything up to this point is fair game. 2054 01:13:04,142 --> 00:00:00,000